Compute Supervised Dimension Projection and related variables for plotting words.

textProjection(
words,
wordembeddings,
single_wordembeddings = single_wordembeddings_df,
x,
y = NULL,
pca = NULL,
aggregation = "mean",
split = "quartile",
word_weight_power = 1,
min_freq_words_test = 0,
Npermutations = 10000,
n_per_split = 50000,
seed = 1003
)

## Arguments

words Word or text variable to be plotted. Word embeddings from textEmbed for the words to be plotted (i.e., the aggregated word embeddings for the "words" parameter). Word embeddings from textEmbed for individual words (i.e., decontextualized embeddings). Numeric variable that the words should be plotted according to on the x-axes. Numeric variable that the words should be plotted according to on the y-axes (y=NULL). Number of PCA dimensions applied to the word embeddings in the beginning of the function. A number below 1 takes out % of variance; An integer specify number of components to extract. (default is NULL as this setting has not yet been evaluated). Method to aggregate the word embeddings (default = "mean"; see also "min", "max", and "[CLS]"). Method to split the axes (default = "quartile" involving selecting lower and upper quartile; see also "mean"). However, if the variable is only containing two different values (i.e., being dichotomous) mean split is used. Compute the power of the frequency of the words and multiply the word embeddings with this in the computation of aggregated word embeddings for group low (1) and group high (2). This increases the weight of more frequent words. Option to select words that have occurred a specified number of times (default = 0); when creating the Supervised Dimension Projection line (i.e., single words receive Supervised Dimension Projection and p-value). Number of permutations in the creation of the null distribution. A setting to split Npermutations to avoid reaching computer memory limits; the higher the faster, but too high may lead to abortion. Set different seed.

## Value

A dataframe with variables (e.g., including Supervised Dimension Projection, frequencies, p-values) for the individual words that is used for the plotting in the textProjectionPlot function.

## Examples

# Data
wordembeddings <- wordembeddings4
raw_data <- Language_based_assessment_data_8
# Pre-processing data for plotting
df_for_plotting <- textProjection(
words = raw_data$harmonywords, wordembeddings = wordembeddings$harmonywords,
single_wordembeddings = wordembeddings$singlewords_we, x = raw_data$hilstotal,
split = "mean",
Npermutations = 10,
n_per_split = 1
)
df_for_plotting#> $background #>$background[[1]]
#> $background[[1]]$Aggregated_word_embedding_group1.x
#> # A tibble: 1 x 8
#>     Dim1   Dim2   Dim3   Dim4   Dim5   Dim6   Dim7  Dim8
#>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>
#> 1 -0.414 -0.209 -0.359 -0.221 0.0705 -0.147 -0.138 0.268
#>
#> $background[[1]]$Aggregated_word_embedding_group2.x
#> # A tibble: 1 x 8
#>     Dim1   Dim2   Dim3   Dim4   Dim5   Dim6   Dim7  Dim8
#>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>
#> 1 -0.378 -0.235 -0.197 -0.231 0.0142 -0.191 -0.119 0.293
#>
#> $background[[1]]$dot_null_distribution.x
#> # A tibble: 2 x 1
#>     value
#>     <dbl>
#> 1 -0.0941
#> 2 -0.0712
#>
#>
#>
#> \$word_data
#> # A tibble: 298 x 8
#>    words      n    dot.x p_values_dot.x n_g1.x n_g2.x n.percent N_participant_r…
#>    <chr>  <dbl>    <dbl>          <dbl>  <dbl>  <dbl>     <dbl>            <int>
#>  1 Group…     0 -0.0582           0.333      0      0   0                     40
#>  2 Group…     0 -0.0241           0.333      0      0   0                     40
#>  3 proje…     0  0.0341           0.333      0      0   0                     40
#>  4 accep…     2 NA               NA         -1      1   0.00504               40
#>  5 agree…     1  0.0365           0.333      0      1   0.00252               40
#>  6 alcoh…     1 -0.0546           0.333     -1      0   0.00252               40
#>  7 amazed     1  0.0372           0.333      0      1   0.00252               40
#>  8 amica…     1  0.0641           0.333      0      1   0.00252               40
#>  9 amity      1  0.0271           0.333      0      1   0.00252               40
#> 10 amused     1 -0.00307          0.333      0      1   0.00252               40
#> # … with 288 more rows