A word embedding comprises values that represent the latent meaning
of a word. The numbers may be seen as coordinates in a space that
comprises several hundred dimensions. The more similar two words’
embeddings are, the closer positioned they are in this embedding space,
and thus, the more similar the words are in meaning. Hence, embeddings
reflect the relationships among words, where proximity in the embedding
space represents similarity in latent meaning. The
text-package enables you to use already existing
Transformers (language models (from Hugging Face) to map text data to
high quality word embeddings.
To represent several words, sentences and paragraphs, word embeddings of single words may be combined or aggregated into one word embedding. This can be achieved by taking the mean, minimum or maximum value of each dimension of the embeddings.
This tutorial focuses on how to retrieve layers and how to
aggregate them to receive word embeddings in
The focus will be on the actual functions.
For more detailed information about word embeddings and the language models in regard to
text please see text: An R-package for Analyzing and
Visualizing Human Language Using Natural Language Processing and Deep
Learning; and for more comprehensive information about the inner
workings of the language models, for example see Illustrated BERT
or the references given in Table 1.
Table 1 show some of the more common language models; for more detailed information see HuggingFace
|‘bert-base-uncased’||Devlin et al. 2019||12||768||English|
|‘roberta-base’||Liu et al. 2019||12||768||English|
|‘distilbert-base-cased’||Sahn et al., 2019||6?||768?||English|
|‘bert-base-multilingual-cased’||Devlin et al.2019||12||768||104 top languages at Wikipedia|
|‘xlm-roberta-large’||Liu et al||24||1024||100 language|
text-package has 3 functions for mapping text to
word embeddings. The
textEmbed() is the high-level
function, which encompasses
textEmbedRawLayers() retrieves layers and hidden states
from a given language model; and
textEmbedLayerAggregation() aggregates these layers in
order to form word embeddings.
textEmbed() selects character variables in a given
dataset (a dataframe/tibble) and transforms these to word embeddings. It
can output contextualized (and decontextualized) embeddings for both
tokens and texts.
Set the language language model that you want using the
model parameter. The text-package automatically downloads
the model from HuggingFace, the first time it is being called.
layers parameter controls the layer(s) to extract
(default is the second to last layer). The
function also provides parameters to aggregate the layers in various
controls how to aggregate layers representing the same token (default is
parameter controls how embeddings from different tokens should be
aggregated to represent a text (default = “mean”). There is also an
controls how the word types embeddings are aggregated.
Note that it is also possible to submit an entire dataset to
textEmbed() – as well as only retrieving text-level and
word-type level embeddings. This is achieved by setting
aggregation_from_tokens_to_word_types to, for example,
“mean”. Word type-level embeddings can be used for plotting words in the
textEmbed() function is suitable when you are just
interested in getting good word embeddings to test some research
hypothesis with. That is, the defaults are based on general experience
of what works. Under the hood
textEmbed uses one function
for retrieving the layers (
textEmbedRawLayers) and another
function for aggregating them (
So, if you are interested in examining different layers and different
aggregation methods it is better to split up the work flow so that you
first retrieve all layers (which takes most time) and then test
different aggregation methods.
textEmbedRawLayers function is used to retrieve the
layers of hidden states.
textEmbedLayerAggreation() function gives you the
possibility to aggregate the layers in different ways (without having to
retrieve them from the language model several times). In
textEmbedLayerAggreation(), you can select any combination
of the layers that you want to aggregate; and then you can aggregate
them using the mean of the dimensions, the minimum or maximum value.
library(text) # Aggregating layer 11 and 12 by taking the mean of each dimension. we_11_12_mean <- textEmbedLayerAggregation( word_embeddings_layers = wordembeddings_tokens_layers$context_tokens$texts, layers = 11:12, aggregation_from_layers_to_tokens = "concatenate", aggregation_from_tokens_to_texts = "mean") we_11_12_mean # Aggregating layer 11 and 12 by taking the minimum of each dimension accross the two layers. we_10_11_min <- textEmbedLayerAggregation( word_embeddings_layers = wordembeddings_tokens_layers$context_tokens$texts, layers = 10:11, aggregation_from_layers_to_tokens = "concatenate", aggregation_from_tokens_to_texts = "min") we_10_11_min # Aggregating layer 1 to 12 by taking the max value of each dimension across the 12 layers. we_11_max <- textEmbedLayerAggregation( word_embeddings_layers = wordembeddings_tokens_layers$context_tokens$texts, layers = 11, aggregation_from_tokens_to_texts = "max") we_11_max
Now the word embeddings are ready to be used in down stream tasks such as predicting numeric variables or be plotted according to different dimensions.