Test whether there is a significant difference in meaning between two sets of texts (i.e., between their word embeddings).

textSimilarityTest(
  x,
  y,
  similarity_method = "cosine",
  Npermutations = 10000,
  method = "paired",
  alternative = c("two_sided", "less", "greater"),
  output.permutations = TRUE,
  N_cluster_nodes = 1,
  seed = 1001
)

Arguments

x

Set of word embeddings from textEmbed.

y

Set of word embeddings from textEmbed.

similarity_method

Character string describing type of measure to be computed; default is "cosine" (see also measures from textDistance (here computed as 1 - textDistance()) including "euclidean", "maximum", "manhattan", "canberra", "binary" and "minkowski").

Npermutations

Number of permutations (default 10000).

method

Compute a "paired" or an "unpaired" test.

alternative

Use a two or one-sided test (select one of: "two_sided", "less", "greater").

output.permutations

If TRUE, returns permuted values in output.

N_cluster_nodes

Number of cluster nodes to use (more makes computation faster; see parallel package).

seed

Set different seed.

Value

A list with a p-value, similarity score estimate and permuted values if output.permutations=TRUE.

Examples

x <- word_embeddings_4$harmonywords
y <- word_embeddings_4$satisfactionwords
textSimilarityTest(x,
  y,
  method = "paired",
  Npermutations = 100,
  N_cluster_nodes = 1,
  alternative = "two_sided"
)
#> $random.estimates.4.null
#>   [1] 0.4983119 0.5576852 0.5302024 0.5523947 0.5192838 0.5069734 0.5426046
#>   [8] 0.5364955 0.5186254 0.5659261 0.5444499 0.5176275 0.5928729 0.5448588
#>  [15] 0.5665786 0.6010373 0.5209911 0.5354354 0.5387865 0.5636173 0.5109059
#>  [22] 0.5056216 0.5477581 0.5350840 0.5805850 0.5226741 0.5113583 0.5649385
#>  [29] 0.5452155 0.5408461 0.5535737 0.5185919 0.5333117 0.5490415 0.5398603
#>  [36] 0.5469504 0.5531048 0.5666717 0.5231832 0.5411863 0.5483832 0.5433614
#>  [43] 0.5440919 0.5554494 0.5259646 0.5156109 0.6094675 0.5571161 0.5698051
#>  [50] 0.5995209 0.5773271 0.5335415 0.5134631 0.5458661 0.5353202 0.5241088
#>  [57] 0.4621846 0.5288316 0.5209879 0.5505337 0.5361928 0.5515760 0.5260797
#>  [64] 0.5307437 0.5245074 0.5504686 0.5401110 0.5514473 0.5652146 0.5123559
#>  [71] 0.5927479 0.5063602 0.5396702 0.5426171 0.5244020 0.5392873 0.5670804
#>  [78] 0.5725231 0.4902566 0.5310120 0.5305490 0.5199153 0.5569171 0.5913419
#>  [85] 0.5827711 0.5493309 0.5374054 0.5789101 0.5505858 0.5584466 0.4953797
#>  [92] 0.4909416 0.5565927 0.5279803 0.5654465 0.5408370 0.4599622 0.5716092
#>  [99] 0.5198856 0.5441615
#> 
#> $embedding_x
#> [1] "x : Information about the embeddings. textEmbedLayersOutput:  model: bert-base-uncased ;  layers: 11 12 . Warnings from python:  Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']\n- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n\n textEmbedLayerAggregation: layers =  11 12 aggregate_layers =  concatenate aggregate_tokens =  mean tokens_select =   tokens_deselect =   "
#> 
#> $embedding_y
#> [1] "y : Information about the embeddings. textEmbedLayersOutput:  model: bert-base-uncased ;  layers: 11 12 . Warnings from python:  Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']\n- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n\n textEmbedLayerAggregation: layers =  11 12 aggregate_layers =  concatenate aggregate_tokens =  mean tokens_select =   tokens_deselect =   "
#> 
#> $test_description
#> [1] "permutations =  100 similarity_method =  cosine method =  paired alternative =  two_sided"
#> 
#> $time_date
#> [1] "Duration to run the test: 0.729598 secs; Date created: 2022-07-22 15:09:01"
#> 
#> $cosine_estimate
#> [1] 0.6069307
#> 
#> $p.value
#> [1] 0.02
#>