Extract layers and aggregate them to word embeddings, for all character variables in a given dataframe.

textEmbed(
x,
model = "bert-base-uncased",
layers = 11:12,
contexts = TRUE,
context_layers = layers,
context_aggregation_layers = "concatenate",
context_aggregation_tokens = "mean",
context_tokens_select = NULL,
context_tokens_deselect = NULL,
decontexts = TRUE,
decontext_layers = layers,
decontext_aggregation_layers = "concatenate",
decontext_aggregation_tokens = "mean",
decontext_tokens_select = NULL,
decontext_tokens_deselect = NULL,
device = "cpu",
model_max_length = NULL,
logging_level = "error"
)

## Arguments

x

A character variable or a tibble/dataframe with at least one character variable.

model

Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base".

layers

Specify the layers that should be extracted (default 11:12). It is more efficient to only extract the layers that you need (e.g., 12). Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus advised to not use. These layers can then be aggregated in the textEmbedLayerAggregation function. If you want all layers then use 'all'.

contexts

Provide word embeddings based on word contexts (standard method; default = TRUE).

context_layers

Specify the layers that should be aggregated (default the number of layers extracted above). Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus advised not to be used.

context_aggregation_layers

Method to aggregate the contextualized layers (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row.

context_aggregation_tokens

Method to aggregate the contextualized tokens (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row.

context_tokens_select

Option to select word embeddings linked to specific tokens such as [CLS] and [SEP] for the context embeddings.

context_tokens_deselect

Option to deselect embeddings linked to specific tokens such as [CLS] and [SEP] for the context embeddings.

decontexts

Provide word embeddings of single words as input (embeddings, e.g., used for plotting; default = TRUE).

decontext_layers

Layers to aggregate for the decontext embeddings the number of layers extracted above.

decontext_aggregation_layers

Method to aggregate the decontextualized layers (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row.

decontext_aggregation_tokens

Method to aggregate the decontextualized tokens (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row.

decontext_tokens_select

Option to select embeddings linked to specific tokens such as [CLS] and [SEP] for the decontext embeddings.

decontext_tokens_deselect

option to deselect embeddings linked to specific tokens such as [CLS] and [SEP] for the decontext embeddings.

device

Name of device to use: 'cpu', 'gpu', or 'gpu:k' where k is a specific device number

model_max_length

The maximum length (in number of tokens) for the inputs to the transformer model (default the value stored for the associated model).

logging_level

Set the logging level. Default: "warning". Options (ordered from less logging to more logging): critical, error, warning, info, debug

## Value

A tibble with tokens, a column for layer identifier and word embeddings. Note that layer 0 is the input embedding to the transformer

# comment(word_embeddings$satisfactionwords) # comment(word_embeddings$singlewords_we)