A word embedding comprises values that represent the latent meaning
of a word. The numbers may be seen as coordinates in a space that
comprises several hundred dimensions. The more similar two words’
embeddings are, the closer positioned they are in this embedding space,
and thus, the more similar the words are in meaning. Hence, embeddings
reflect the relationships among words, where proximity in the embedding
space represents similarity in latent meaning.
already existing language models to map text data to high quality word
To represent several words, sentences and paragraphs, word embeddings of single words may be combined or aggregated into one word embedding. This can be achieved by taking the mean, minimum or maximum value of each dimension of the embeddings.
This tutorial focuses on how to retrieve layers and how to
aggregate them to receive word embeddings in
The focus will be on the actual functions.
For more detailed information about word embeddings and the language
models in regard to
text please see text: An R-package
for Analyzing and Visualizing Human Language Using Natural Language
Processing and Deep Learning; and for more comprehensive
information about the inner workings of the language models, for example
BERT or the references given in Table 1.
Table 1 show some of the more common language models; for more detailed information see HuggingFace
|‘bert-base-uncased’||Devlin et al. 2019||12||768||English|
|‘roberta-base’||Liu et al. 2019||12||768||English|
|‘distilbert-base-cased’||Sahn et al., 2019||6?||768?||English|
|‘bert-base-multilingual-cased’||Devlin et al.2019||12||768||104 top languages at Wikipedia|
|‘xlm-roberta-large’||Liu et al||24||1024||100 language|
The main function to transform text to word embeddings is
textEmbed(). First, provide a tibble containing the
text-variable(s) that you want to transform (note that it is OK to
submit other variables too; the function will only grab the character
variables). Second, set the language
model; using a setting
among the options for
model ensures that you use a model
that have been tested with text.
Setting the advanced options
pretrained_weights = 'bert-base-uncased'),
tokenizer_class (e.g., to
tokenizer_class = BertTokenizer) and
model_class (e.g., to
model_class = BertModel;
model = NULL); allows you to set a model directly with
the HuggingFace interface. Make sure that the pretrained_weights,
tokenizer_class, and model_class fit together (otherwise you will get an
Third, decide whether you want contextualized and/or decontextualized
word embeddings; by setting the
deconext parameters to TRUE/FALSE. Contextualized word
embeddings are standard and return word embeddings that have taken into
account the context in which the word was used; the decontextualized
word embeddings do not take into account the context of how the word was
used (and are used in the plot functions).
Last, select the number of layers you want to use and the way you want to aggregate them.
library(text) # Transform the text data to BERT word embeddings wordembeddings <- textEmbed(x = Language_based_assessment_data_8, model = 'bert-base-uncased', contexts = TRUE, layers = 11:12, context_aggregation = "mean", decontexts = TRUE, decontext_layers = 11:12, decontext_aggregation = "mean") # Save the word embeddings to avoid having to import the text every time # saveRDS(wordembeddings, "_YOURPATH_/wordembeddings.rds") # Get the word embeddings again # wordembeddings <- readRDS("_YOURPATH_/wordembeddings.rds") # See how word embeddings are structured wordembeddings
textEmbed() function is suitable when you are just
interested in getting good word embeddings to test some research
hypothesis with. That is, the defaults are based on general experience
of what works. Under the hood
textEmbed uses one function
for retrieving the layers (
another function for aggregating them
textEmbedLayerAggreation). So, if you are interested in
examining different layers and different aggregation methods it is
better to split up the work flow so that you first retrieve all layers
(which takes most time) and then test different aggregation methods.
textEmbedLayersOutput function is used to retrieve
the layers of hidden states.
library(text) #Transform the text data to BERT word embeddings x <- Language_based_assessment_data_8[1:2, 1:2] wordembeddings_tokens_layers <- textEmbedLayersOutput(x, contexts = TRUE, decontexts = FALSE, model = 'bert-base-uncased', layers = 'all', return_tokens = TRUE) wordembeddings_tokens_layers
The output from the
is the same as that of
textEmbed(); but, now you have the
possibility to test different ways to aggregate the layers without
having to retrieve them from the language model. In
textEmbedLayerAggreation(), you can select any combination
of the layers that you want to aggregate; and then you can select to
aggregate them using the mean of the dimensions, the minimum or maximum
library(text) # Aggregating layer 11 and 12 by taking the mean of each dimension. we_11_12_mean <- textEmbedLayerAggreation(word_embeddings_layers = wordembeddings_tokens_layers, layers = 11:12, aggregation = "mean") # Aggregating layer 11 and 12 by taking the minimum of each dimension accross the two layers. we_11_12_min <- textEmbedLayerAggreation(word_embeddings_layers = wordembeddings_tokens_layers, layers = 11:12, aggregation = "min") # Aggregating layer 1 to 12 by taking the max value of each dimension accross the 12 layers. we_1_12_min <- textEmbedLayerAggreation(word_embeddings_layers = wordembeddings_tokens_layers, layers = 1:12, aggregation = "max") we_1_12_min
Now the word embeddings are ready to be used in down stream tasks such as predicting numeric variables or be plotted according to different dimensions.