The text-package uses natural language processing and machine learning methods to examine text and numerical variables.
Central text functions are described below. The data and methods come from the Kjell et al., 2019 (pre-print), which show how individuals’ open-ended text answers can be used to measure, describe and differentiate psychological constructs.
In short the workflow includes to first transform text variables into word embeddings. These word embeddings are then used to, for example, predict numerical variables, compute semantic similarity scores, statistically test difference in meaning between two sets of texts and plot words in the word embedding space.
textEmbed() function automatically transforms character variables in a given tibble to word embeddings. The example data that will be used in this tutorial comes from participants that have described their harmony in life and satisfaction with life with a text response, 10 descriptive words or rating scales. For a more detailed description please see the word embedding tutorial
library(text) # Get example data including both text and numerical variables sq_data <- Language_based_assessment_data_8 # Transform the text data to BERT word embeddings wordembeddings <- textEmbed(sq_data) # See how word embeddings are structured wordembeddings # Save the word embeddings to avoid having to import the text every time saveRDS(wordembeddings, "wordembeddings.rds") # Get the word embeddings again wordembeddings <- readRDS("_YOURPATH_/wordembeddings.rds")
textTrain() is used to examine how well the word embeddings from a text can predict a numeric variable. This is done by training the word embeddings using ridge regression and 10-fold cross-validation (where the word embeddings are pre-processed using pca). In the example below we examine how well the harmony text responses can predict the rating scale scores from the Harmony in life scale.
library(text) library(rio) # Load data that has already gone through textEmbed # The previous example only imported 10 participants; # whereas below we load data from 100 participants wordembeddings <- rio::import("https://r-text.org/text_data_examples/wordembeddings4_100.rda") # Load corresponding numeric variables numeric_data <- rio::import("https://r-text.org/text_data_examples/Language_based_assessment_data_8_100.rda") # Examine the relationship between harmonytext and the corresponding rating scale model_htext_hils <- textTrain(wordembeddings$harmonytexts, numeric_data$hilstotal, penalty = 1) # Examine the correlation between predicted and observed Harmony in life scale scores model_htext_hils$correlation
textSimilarityTest() function provides a permutation based test to examine whether two sets of texts significantly differ in meaning. It produces a p-value and estimate as an effect size. Below we examine whether the harmony text and satisfaction text responses differ in meaning.
The plotting is made in two steps: First the
textProjection function is pre-processing the data, including computing statistics for each word to be plotted. Second,
textProjectionPlot() is visualizing the words, including many options to set color, font etc for the figure. Dividing this procedure into two steps makes the process more transparent (since the user naturally get to see the output that the words are plotted according to) and quicker since the more heavy computations are made in the first step, the last step goes quicker so that one can try different design settings.
library(text) # Pre-process word data to be plotted with textPlotViz-function # wordembeddings4 and Language_based_assessment_data_8 contain example data provided with the package. # Pre-process data df_for_plotting <- textProjection(Language_based_assessment_data_8$harmonywords, wordembeddings4$harmonywords, wordembeddings4$singlewords_we, Language_based_assessment_data_8$hilstotal, Language_based_assessment_data_8$swlstotal ) df_for_plotting
library(text) # Used data (DP_projections_HILS_SWLS_100) has # been pre-processed with the textProjection function plot_projection <- textProjectionPlot( word_data = DP_projections_HILS_SWLS_100, k_n_words_to_test = FALSE, plot_n_words_square = 5, plot_n_words_p = 5, plot_n_word_extreme = 1, plot_n_word_frequency = 1, plot_n_words_middle = 1, y_axes = TRUE, p_alpha = 0.05, title_top = " Supervised Bicentroid Projection of Harmony in life words", x_axes_label = "Low vs. High HILS score", y_axes_label = "Low vs. High SWLS score", p_adjust_method = "bonferroni", points_without_words_size = 0.4, points_without_words_alpha = 0.4 ) plot_projection #> $final_plot
#> #> $description #>  "INFORMATION ABOUT THE PROJECTION INFORMATION ABOUT THE PLOT word_data = DP_projections_HILS_SWLS_100 k_n_words_to_test = FALSE min_freq_words_test = 1 min_freq_words_plot = 1 plot_n_words_square = 5 plot_n_words_p = 5 plot_n_word_extreme = 1 plot_n_word_frequency = 1 plot_n_words_middle = 1 y_axes = TRUE p_alpha = 0.05 p_adjust_method = bonferroni bivariate_color_codes = #398CF9 #60A1F7 #5dc688 #e07f6a #EAEAEA #40DD52 #FF0000 #EA7467 #85DB8E word_size_range = 3 - 8 position_jitter_hight = 0 position_jitter_width = 0.03 point_size = 0.5 arrow_transparency = 0.5 points_without_words_size = 0.4 points_without_words_alpha = 0.4 legend_x_position = 0.02 legend_y_position = 0.02 legend_h_size = 0.2 legend_w_size = 0.2 legend_title_size = 7 legend_number_size = 2" #> #> $processed_word_data #> # A tibble: 583 x 32 #> words dot.x p_values_dot.x n_g1.x n_g2.x dot.y p_values_dot.y n_g1.y #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 able 6.86e-1 0.194 NA 1 2.31 0.0123 NA #> 2 acce… 1.52e+0 0.0272 -1 2 1.15 0.0620 -1 #> 3 acco… 2.14e+0 0.00856 NA 1 3.51 0.00273 NA #> 4 acti… 1.23e+0 0.0503 NA 1 1.56 0.0361 NA #> 5 adap… -3.87e-4 0.969 -1 NA 0.331 0.476 -1 #> 6 admi… 5.14e-1 0.315 NA 1 1.52 0.0398 NA #> 7 adri… -3.79e+0 0.00000100 -1 NA -3.60 0.00000100 -1 #> 8 affi… 7.49e-1 0.150 NA 1 2.04 0.0184 NA #> 9 agre… 2.23e+0 0.00626 NA 1 1.69 0.0312 NA #> 10 alco… -5.51e-1 0.318 -1 NA -1.07 0.0605 -1 #> # … with 573 more rows, and 24 more variables: n_g2.y <dbl>, n <int>, #> # n.percent <dbl>, N_participant_responses <int>, adjusted_p_values.x <dbl>, #> # adjusted_p_values.y <dbl>, square_categories <dbl>, check_p_square <dbl>, #> # check_p_x_neg <dbl>, check_p_x_pos <dbl>, check_extreme_max_x <dbl>, #> # check_extreme_min_x <dbl>, check_extreme_frequency_x <dbl>, #> # check_middle_x <dbl>, extremes_all_x <dbl>, check_p_y_pos <dbl>, #> # check_p_y_neg <dbl>, check_extreme_max_y <dbl>, check_extreme_min_y <dbl>, #> # check_extreme_frequency_y <dbl>, check_middle_y <dbl>, #> # extremes_all_y <dbl>, extremes_all <dbl>, colour_categories <chr>
Text is new and has not been used in a publication yet. therefore, the below list consists of papers analyzing human language in a similar fashion that is possible text.
Gaining insights from social media language: Methodologies and challenges.
Kern et al., (2016). Psychological Methods.
Semantic measures: Using natural language processing to measure, differentiate, and describe psychological constructs. Pre-print
Kjell et al., (2019). Psychological Methods.
Facebook language predicts depression in medical records
Eichstaedt, J. C., … & Schwartz, H. A. (2018). PNAS.
Social and Personality Psychology
Personality, gender, and age in the language of social media: The open-vocabulary approach
Schwartz, H. A., … & Seligman, M. E. (2013). PloSOne.
Automatic Personality Assessment Through Social Media Language
Park, G., Schwartz, H. A., … & Seligman, M. E. P. (2014). Journal of Personality and Social Psychology.
Psychological language on Twitter predicts county-level heart disease mortality
Eichstaedt, J. C., Schwartz, et al. (2015). Psychological Science.
The Harmony in Life Scale Complements the Satisfaction with Life Scale: Expanding the Conceptualization of the Cognitive Component of Subjective Well-Being
Kjell, et al., (2016). Social Indicators Research
Computer Science: Python Software
DLATK: Differential language analysis toolkit
Schwartz, H. A., Giorgi, et al., (2017). In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations