vignettes/huggingface_in_r_and_computer_capacity.Rmd
huggingface_in_r_and_computer_capacity.Rmd
Many NLP analyses require a lot of computational resources; but standard sized datasets in psychology can be analysed with text on a standard laptop. It is rather the computation of the pre-trained language models that require a huge amount of computational resources. Wolf et al., (2020) exemplify this by pointing out that the RoBERTa language model:
“was trained on 160 GB of text using 1024 32GB V100. On Amazon-Web-Services cloud computing (AWS), such a pretraining would cost approximately 100K USD.” (p. 2)
Hence, there are a lot of computations behind the word embeddings. In
text, the most computationally heavy and time consuming
elements are the process of retrieving word embeddings using
textEmbedLayersOutput
(which is also used in
textEmbed
). Retrieving word embeddings for a standard
dataset with a few hundred participants may take between 15 minutes to
an hour. Hence, it is worth planning analyses. A few time and resource
management advice include:
library(text)
# Save starting time
T1 <- Sys.time()
textEmbed(Language_based_assessment_data_8_10[1,1],
layers = 12,
decontexts = FALSE)
# Save stoping time
T2 <- Sys.time()
# Compute time taken to run above function
T2-T1
Thinking about your computer’s memory capacity may become important if you have a lot of data and use multiple layers with many dimensions. For example consider that one sentence of 10 words/tokens, which are each represented by 12 layers a 768 dimensions results in 92 160 values (i.e., 10 x 12 x 768). To avoid running out of memory and get analyses to run faster, consider to:
layers = 11:12
) rather than retrieving all layers (i.e.,
layers = 'all'
).return_tokens = FALSE
.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., & Funtowicz, M. (2019). Huggingface’s transformers: State-of-the-art natural language processing.