How to best manage computationally heavy analyses • text

Many NLP analyses require a lot of computational resources; but standard sized datasets in psychology can be analysed with text on a standard laptop. It is rather the computation of the pre-trained language models that require a huge amount of computational resources. Wolf et al., (2020) exemplify this by pointing out that the RoBERTa language model:

“was trained on 160 GB of text using 1024 32GB V100. On Amazon-Web-Services cloud computing (AWS), such a pretraining would cost approximately 100K USD.” (p. 2)

Hence, there are a lot of computations behind the word embeddings. In text, the most computationally heavy and time consuming elements are the process of retrieving word embeddings using textEmbedLayersOutput (which is also used in textEmbed). Retrieving word embeddings for a standard dataset with a few hundred participants may take between 15 minutes to an hour. Hence, it is worth planning analyses. A few time and resource management advice include:

Testing: Before you run the analyses on your entire dataset, ensure that everything first runs smoothly on a small part of the data set (e.g., 20 rows of your data). Note that this should not be a process of testing different setting, but only to see that everything works.
Timekeeping: When running different analyses have the computer take time (see code example below), so that you get a better understanding of how long time different analyses take.
Scheduling: Run those analyses that take longer time over a coffee break or over night. So for example, it might be worth retrieving all word embeddings at a separate time.


library(text)
# Save starting time
T1 <- Sys.time()
textEmbed(Language_based_assessment_data_8_10[1,1], 
          layers = 12, 
          decontexts = FALSE)
# Save stoping time
T2 <- Sys.time()

# Compute time taken to run above function
T2-T1

Your system’s capacity

Thinking about your computer’s memory capacity may become important if you have a lot of data and use multiple layers with many dimensions. For example consider that one sentence of 10 words/tokens, which are each represented by 12 layers a 768 dimensions results in 92 160 values (i.e., 10 x 12 x 768). To avoid running out of memory and get analyses to run faster, consider to:

Only retrieve the layers that you plan to use (e.g., layers = 11:12) rather than retrieving all layers (i.e., layers = 'all').
Only retrieve tokens if you are planing to use them; otherwise set return_tokens = FALSE.
Do not ask for decontextualized word embeddings if you are not going to us them (e.g., in plotting).

References

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., & Funtowicz, M. (2019). Huggingface’s transformers: State-of-the-art natural language processing.