Extract layers of hidden states (word embeddings) for all character variables in a given dataframe.

textEmbedLayersOutput(
  x,
  contexts = TRUE,
  single_context_embeddings = FALSE,
  decontexts = TRUE,
  model = "bert-base-uncased",
  layers = 11,
  return_tokens = TRUE,
  dim_name = FALSE,
  device = "cpu",
  tokenizer_parallelism = FALSE,
  model_max_length = NULL,
  logging_level = "error"
)

Arguments

x

A character variable or a tibble/dataframe with at least one character variable.

contexts

Provide word embeddings based on word contexts (standard method; default = TRUE).

single_context_embeddings

Aggregated contextualized word embeddings for each token in the text variables (only works for one text-variable at the time).

decontexts

Provide word embeddings of single words as input (embeddings used for plotting; default = TRUE).

model

Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base".

layers

Specify the layers that should be extracted (default 11). It is more efficient to only extract the layers that you need (e.g., 11). You can also extract several (e.g., 11:12), or all by setting this parameter to "all". Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus should normally not be used. These layers can then be aggregated in the textEmbedLayerAggregation function.

return_tokens

If TRUE, provide the tokens used in the specified transformer model.

dim_name

Boolean, if TRUE append the variable name after all variable-names in the output. (This differentiates between word embedding dimension names; e.g., Dim1_text_variable_name)

device

Name of device to use: 'cpu', 'gpu', or 'gpu:k' where k is a specific device number

tokenizer_parallelism

If TRUE this will turn on tokenizer parallelism. Default FALSE.

model_max_length

The maximum length (in number of tokens) for the inputs to the transformer model (default the value stored for the associated model).

logging_level

Set the logging level. Default: "warning". Options (ordered from less logging to more logging): critical, error, warning, info, debug

Value

A tibble with tokens, column specifying layer and word embeddings. Note that layer 0 is the input embedding to the transformer, and should normally not be used.

Examples

# \donttest{
# x <- Language_based_assessment_data_8[1:2, 1:2]
# word_embeddings_with_layers <- textEmbedLayersOutput(x, layers = 11:12)
# }