Extract layers and aggregate them to word embeddings, for all character variables in a given dataframe.

  model = "bert-base-uncased",
  layers = -2,
  dim_name = TRUE,
  aggregation_from_layers_to_tokens = "concatenate",
  aggregation_from_tokens_to_texts = "mean",
  aggregation_from_tokens_to_word_types = NULL,
  keep_token_embeddings = TRUE,
  tokens_select = NULL,
  tokens_deselect = NULL,
  decontextualize = FALSE,
  model_max_length = NULL,
  max_token_to_sentence = 4,
  tokenizer_parallelism = FALSE,
  device = "gpu",
  logging_level = "error"



A character variable or a tibble/dataframe with at least one character variable.


Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base". Only load models that you trust from HuggingFace; loading a malicious model can execute arbitrary code on your computer).


(string or numeric) Specify the layers that should be extracted (default -2 which give the second to last layer). It is more efficient to only extract the layers that you need (e.g., 11). You can also extract several (e.g., 11:12), or all by setting this parameter to "all". Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus should normally not be used. These layers can then be aggregated in the textEmbedLayerAggregation function.


Boolean, if TRUE append the variable name after all variable-names in the output. (This differentiates between word embedding dimension names; e.g., Dim1_text_variable_name). see textDimName to change names back and forth.


(string) Aggregated layers of each token. Method to aggregate the contextualized layers (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row.


(string) Aggregates to the individual text (i.e., the aggregation of all tokens/words given to the transformer).


(string) Aggregates to the word type (i.e., the individual words) rather than texts.


(boolean) Whether to also keep token embeddings when using texts or word types aggregation.


Option to select word embeddings linked to specific tokens such as [CLS] and [SEP] for the context embeddings.


Option to deselect embeddings linked to specific tokens such as [CLS] and [SEP] for the context embeddings.


(boolean) Provide word embeddings of single words as input to the model (these embeddings are, e.g., used for plotting; default is to use ). If using this, then set single_context_embeddings to FALSE.


The maximum length (in number of tokens) for the inputs to the transformer model (default the value stored for the associated model).


(numeric) Maximum number of tokens in a string to handle before switching to embedding text sentence by sentence.


(boolean) If TRUE this will turn on tokenizer parallelism. Default FALSE.


Name of device to use: 'cpu', 'gpu', or 'gpu:k' where k is a specific device number


Set the logging level. Default: "warning". Options (ordered from less logging to more logging): critical, error, warning, info, debug


A tibble with tokens, a column for layer identifier and word embeddings. Note that layer 0 is the input embedding to the transformer


# \donttest{
# word_embeddings <- textEmbed(Language_based_assessment_data_8[1:2, 1:2],
#                             layers = 10:11,
#                             aggregation_from_layers_to_tokens = "concatenate",
#                             aggregation_from_tokens_to_texts = "mean",
#                             aggregation_from_tokens_to_word_types = "mean")
## Show information about how the embeddings were constructed
# comment(word_embeddings$texts$satisfactiontexts)
# comment(word_embeddings$word_types)
# comment(word_embeddings$tokens$satisfactiontexts)
# }