Extract layers and aggregate them to word embeddings, for all character variables in a given dataframe.

textEmbed(
x,
model = "bert-base-uncased",
layers = 11:12,
contexts = TRUE,
context_layers = layers,
context_aggregation_layers = "concatenate",
context_aggregation_tokens = "mean",
context_tokens_select = NULL,
context_tokens_deselect = NULL,
decontexts = TRUE,
decontext_layers = layers,
decontext_aggregation_layers = "concatenate",
decontext_aggregation_tokens = "mean",
decontext_tokens_select = NULL,
decontext_tokens_deselect = NULL
)

Arguments

x A character variable or a tibble/dataframe with at least one character variable. Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base". Specify the layers that should be extracted (default 11:12). It is more efficient to only extract the layers that you need (e.g., 12). Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus advised to not use. These layers can then be aggregated in the textEmbedLayerAggregation function. If you want all layers then use 'all'. Provide word embeddings based on word contexts (standard method; default = TRUE). Specify the layers that should be aggregated (default the number of layers extracted above). Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus advised not to be used. Method to aggregate the contextualized layers (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row. Method to aggregate the contextualized tokens (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row. Option to select word embeddings linked to specific tokens such as [CLS] and [SEP] for the context embeddings. Option to deselect embeddings linked to specific tokens such as [CLS] and [SEP] for the context embeddings. Provide word embeddings of single words as input (embeddings, e.g., used for plotting; default = TRUE). Layers to aggregate for the decontext embeddings the number of layers extracted above. Method to aggregate the decontextualized layers (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row. Method to aggregate the decontextualized tokens (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row. Option to select embeddings linked to specific tokens such as [CLS] and [SEP] for the decontext embeddings. option to deselect embeddings linked to specific tokens such as [CLS] and [SEP] for the decontext embeddings.

Value

A tibble with tokens, a column for layer identifier and word embeddings. Note that layer 0 is the input embedding to the transformer

see textEmbedLayerAggregation and textEmbedLayersOutput
# \donttest{
comment(wordembeddings$satisfactionwords)#> [1] "Information about the embeddings. textEmbedLayersOutput: model: bert-base-uncased layers: 9 10 11 . textEmbedLayerAggregation: layers = 11 aggregate_layers = concatenate aggregate_tokens = mean tokens_select = tokens_deselect = "comment(wordembeddings$singlewords_we)#> [1] "Information about the embeddings. textEmbedLayersOutput:  bert-base-uncased layers: 9 10 11 . textEmbedLayerAggregation: layers =  9 aggregate_layers =  concatenate aggregate_tokens =  mean tokens_select =   tokens_deselect =  "comment(wordembeddings)#> [1] "Duration to embed text: 18.413594 secs; Date created: 2021-02-12 19:00:05"# Example 2
# }