Test whether there is a significant difference in meaning between two sets of texts (i.e., between their word embeddings).

textSimilarityTest(
  x,
  y,
  similarity_method = "cosine",
  Npermutations = 10000,
  method = "paired",
  alternative = c("two_sided", "less", "greater"),
  output.permutations = TRUE,
  N_cluster_nodes = 1,
  seed = 1001
)

Arguments

x

Set of word embeddings from textEmbed.

y

Set of word embeddings from textEmbed.

similarity_method

Character string describing type of measure to be computed; default is "cosine" (see also measures from textDistance (here computed as 1 - textDistance()) including "euclidean", "maximum", "manhattan", "canberra", "binary" and "minkowski").

Npermutations

Number of permutations (default 10000).

method

Compute a "paired" or an "unpaired" test.

alternative

Use a two or one-sided test (select one of: "two_sided", "less", "greater").

output.permutations

If TRUE, returns permuted values in output.

N_cluster_nodes

Number of cluster nodes to use (more makes computation faster; see parallel package).

seed

Set different seed.

Value

A list with a p-value, similarity score estimate and permuted values if output.permutations=TRUE.

Examples

x <- word_embeddings_4$harmonywords y <- word_embeddings_4$satisfactionwords textSimilarityTest(x, y, method = "paired", Npermutations = 100, N_cluster_nodes = 1, alternative = "two_sided" )
#> $random.estimates.4.null #> [1] 0.4983119 0.5576852 0.5302024 0.5523947 0.5192838 0.5069734 0.5426046 #> [8] 0.5364955 0.5186254 0.5659261 0.5444499 0.5176275 0.5928729 0.5448588 #> [15] 0.5665786 0.6010373 0.5209911 0.5354354 0.5387865 0.5636173 0.5109059 #> [22] 0.5056216 0.5477581 0.5350840 0.5805850 0.5226741 0.5113583 0.5649385 #> [29] 0.5452155 0.5408461 0.5535737 0.5185919 0.5333117 0.5490415 0.5398603 #> [36] 0.5469504 0.5531048 0.5666717 0.5231832 0.5411863 0.5483832 0.5433614 #> [43] 0.5440919 0.5554494 0.5259646 0.5156109 0.6094675 0.5571161 0.5698051 #> [50] 0.5995209 0.5773271 0.5335415 0.5134631 0.5458661 0.5353202 0.5241088 #> [57] 0.4621846 0.5288316 0.5209879 0.5505337 0.5361928 0.5515760 0.5260797 #> [64] 0.5307437 0.5245074 0.5504686 0.5401110 0.5514473 0.5652146 0.5123559 #> [71] 0.5927479 0.5063602 0.5396702 0.5426171 0.5244020 0.5392873 0.5670804 #> [78] 0.5725231 0.4902566 0.5310120 0.5305490 0.5199153 0.5569171 0.5913419 #> [85] 0.5827711 0.5493309 0.5374054 0.5789101 0.5505858 0.5584466 0.4953797 #> [92] 0.4909416 0.5565927 0.5279803 0.5654465 0.5408370 0.4599622 0.5716092 #> [99] 0.5198856 0.5441615 #> #> $embedding_x #> [1] "x : Information about the embeddings. textEmbedLayersOutput: model: bert-base-uncased ; layers: 11 12 . Warnings from python: Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']\n- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n\n textEmbedLayerAggregation: layers = 11 12 aggregate_layers = concatenate aggregate_tokens = mean tokens_select = tokens_deselect = " #> #> $embedding_y #> [1] "y : Information about the embeddings. textEmbedLayersOutput: model: bert-base-uncased ; layers: 11 12 . Warnings from python: Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']\n- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n\n textEmbedLayerAggregation: layers = 11 12 aggregate_layers = concatenate aggregate_tokens = mean tokens_select = tokens_deselect = " #> #> $test_description #> [1] "permutations = 100 similarity_method = cosine method = paired alternative = two_sided" #> #> $time_date #> [1] "Duration to run the test: 0.971579 secs; Date created: 2022-05-18 20:25:14" #> #> $cosine_estimate #> [1] 0.6069307 #> #> $p.value #> [1] 0.02 #>