Extract layers of hidden states (word embeddings) for all character variables in a given dataframe.

textEmbedLayersOutput(
  x,
  contexts = TRUE,
  single_context_embeddings = FALSE,
  decontexts = TRUE,
  model = "bert-base-uncased",
  layers = 11,
  return_tokens = TRUE,
  device = "cpu",
  tokenizer_parallelism = FALSE,
  model_max_length = NULL,
  logging_level = "error"
)

Arguments

x: A character variable or a tibble/dataframe with at least one character variable.
contexts: Provide word embeddings based on word contexts (standard method; default = TRUE).
single_context_embeddings: Aggregated contextualized word embeddings for each token in the text variables (only works for one text-variable at the time).
decontexts: Provide word embeddings of single words as input (embeddings used for plotting; default = TRUE).
model: Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base".
layers: Specify the layers that should be extracted (default 11). It is more efficient to only extract the layers that you need (e.g., 11). You can also extract several (e.g., 11:12), or all by setting this parameter to "all". Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus should normally not be used. These layers can then be aggregated in the textEmbedLayerAggregation function.
return_tokens: If TRUE, provide the tokens used in the specified transformer model.
device: Name of device to use: 'cpu', 'gpu', or 'gpu:k' where k is a specific device number
tokenizer_parallelism: If TRUE this will turn on tokenizer parallelism. Default FALSE.
model_max_length: The maximum length (in number of tokens) for the inputs to the transformer model (default the value stored for the associated model).
logging_level: Set the logging level. Default: "warning". Options (ordered from less logging to more logging): critical, error, warning, info, debug

Value

A tibble with tokens, column specifying layer and word embeddings. Note that layer 0 is the input embedding to the transformer, and should normally not be used.

Examples

# \donttest{
# x <- Language_based_assessment_data_8[1:2, 1:2]
# word_embeddings_with_layers <- textEmbedLayersOutput(x, layers = 11:12)
# }

Extract layers of hidden states (word embeddings) for all character variables in a given dataframe.

Arguments

Value

See also

Examples