How to best manage computationally heavy analyses
Source:vignettes/huggingface_in_r_and_computer_capacity.Rmd
huggingface_in_r_and_computer_capacity.Rmd
Many NLP analyses require a lot of computational resources; but standard sized datasets in psychology can be analysed with text on a standard laptop. It is rather the computation of the pre-trained language models that require a huge amount of computational resources. Wolf et al., (2020) exemplify this by pointing out that the RoBERTa language model:
“was trained on 160 GB of text using 1024 32GB V100. On Amazon-Web-Services cloud computing (AWS), such a pretraining would cost approximately 100K USD.” (p. 2)
Hence, there are a lot of computations behind the word embeddings. In
text, the most computationally heavy and time consuming
elements are the process of retrieving word embeddings using
textEmbedLayersOutput
(which is also used in
textEmbed
). Retrieving word embeddings for a standard
dataset with a few hundred participants may take between 15 minutes to
an hour. Hence, it is worth planning analyses. A few time and resource
management advice include:
-
Testing: Before you run the analyses on your entire
dataset, ensure that everything first runs smoothly on a small part of
the data set (e.g., 20 rows of your data). Note that this should not be
a process of testing different setting, but only to see that everything
works.
- Timekeeping: When running different analyses have the computer take time (see code example below), so that you get a better understanding of how long time different analyses take.
- Scheduling: Run those analyses that take longer time over a coffee break or over night. So for example, it might be worth retrieving all word embeddings at a separate time.
library(text)
# Save starting time
T1 <- Sys.time()
textEmbed(Language_based_assessment_data_8_10[1,1],
layers = 12,
decontexts = FALSE)
# Save stoping time
T2 <- Sys.time()
# Compute time taken to run above function
T2-T1
Your system’s capacity
Thinking about your computer’s memory capacity may become important if you have a lot of data and use multiple layers with many dimensions. For example consider that one sentence of 10 words/tokens, which are each represented by 12 layers a 768 dimensions results in 92 160 values (i.e., 10 x 12 x 768). To avoid running out of memory and get analyses to run faster, consider to:
- Only retrieve the layers that you plan to use (e.g.,
layers = 11:12
) rather than retrieving all layers (i.e.,layers = 'all'
). - Only retrieve tokens if you are planing to use them; otherwise set
return_tokens = FALSE
. - Do not ask for decontextualized word embeddings if you are not going to us them (e.g., in plotting).
References
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., & Funtowicz, M. (2019). Huggingface’s transformers: State-of-the-art natural language processing.