textPredict, textAssess and textClassify
Source:R/2_4_0_textPredict_Assess_Classify.R
textPredict.Rd
Trained models created by e.g., textTrain() or stored on e.g., github of huggingface can be used to predict scores or classes from embeddings or text using one of these function aliases.
Usage
textPredict(
model_info = "valence_facebook_mxbai23_eijsbroek2024",
texts = NULL,
model_type = "detect",
lbam_update = TRUE,
word_embeddings = NULL,
x_append = NULL,
append_first = NULL,
dim_names = TRUE,
language_distribution = NULL,
language_distribution_min_words = "trained_distribution_min_words",
save_model = TRUE,
threshold = NULL,
show_texts = FALSE,
device = "cpu",
participant_id = NULL,
save_embeddings = TRUE,
save_dir = "wd",
save_name = "textPredict",
story_id = NULL,
dataset_to_merge_assessments = NULL,
previous_sentence = FALSE,
tokenizer_parallelism = FALSE,
logging_level = "error",
force_return_results = TRUE,
return_all_scores = FALSE,
function_to_apply = NULL,
set_seed = 202208,
...
)
textAssess(
model_info = "valence_facebook_mxbai23_eijsbroek2024",
texts = NULL,
model_type = "detect",
lbam_update = TRUE,
word_embeddings = NULL,
x_append = NULL,
append_first = NULL,
dim_names = TRUE,
language_distribution = NULL,
language_distribution_min_words = "trained_distribution_min_words",
save_model = TRUE,
threshold = NULL,
show_texts = FALSE,
device = "cpu",
participant_id = NULL,
save_embeddings = TRUE,
save_dir = "wd",
save_name = "textPredict",
story_id = NULL,
dataset_to_merge_assessments = NULL,
previous_sentence = FALSE,
tokenizer_parallelism = FALSE,
logging_level = "error",
force_return_results = TRUE,
return_all_scores = FALSE,
function_to_apply = NULL,
set_seed = 202208,
...
)
textClassify(
model_info = "valence_facebook_mxbai23_eijsbroek2024",
texts = NULL,
model_type = "detect",
lbam_update = TRUE,
word_embeddings = NULL,
x_append = NULL,
append_first = NULL,
dim_names = TRUE,
language_distribution = NULL,
language_distribution_min_words = "trained_distribution_min_words",
save_model = TRUE,
threshold = NULL,
show_texts = FALSE,
device = "cpu",
participant_id = NULL,
save_embeddings = TRUE,
save_dir = "wd",
save_name = "textPredict",
story_id = NULL,
dataset_to_merge_assessments = NULL,
previous_sentence = FALSE,
tokenizer_parallelism = FALSE,
logging_level = "error",
force_return_results = TRUE,
return_all_scores = FALSE,
function_to_apply = NULL,
set_seed = 202208,
...
)
Arguments
- model_info
(character or r-object) model_info has four options, including: 1: An R model (e.g, saved output from one of the textTrain() functions). 2: The name specified in the L-BAM Documentation. For the following settings, remember to also set the model_type parameter: 3: Link to a text-trained model online (either in a github repo (e.g, "https://github.com/CarlViggo/pretrained_swls_model/raw/main/trained_github_model_logistic.RDS"or OSF https://osf.io/8fp7v) 4: Name or link to a fine-tuned model from Huggingface (e.g., "distilbert-base-uncased-finetuned-sst-2-english"). 5: Path to a model stored locally (e.g, "path/to/your/model/model_name.rds").
- texts
(character) Text to predict. If this argument is specified, then arguments "word_embeddings" and "premade embeddings" cannot be defined (default = NULL).
- model_type
(character) Specify how the function should handle the model argument. The default is "detect" where the fucntion ttried to detect it automatically. Setting it to "fine-tuned" or "text-trained" will apply their respective default behaviors, while setting it to "implicit motives" will trigger specific steps tailored to these models.
- lbam_update
(boolean) Updating the L-BAM file by automatically downloading it from Google Sheet.
- word_embeddings
(tibble; only for "text-trained"-model_type) Embeddings from e.g., textEmbed(). If you're using a pre-trained model, then texts and embeddings cannot be submitted simultaneously (default = NULL).
- x_append
(tibble; only for "text-trained"-model_type) Variables to be appended with the word embeddings (x).
- append_first
(boolean; only for "text-trained" models) If TRUE, x_appened is added before word embeddings.
- dim_names
(boolean; only for "text-trained"-models) Account for specific dimension names from textEmbed() (rather than generic names including Dim1, Dim2 etc.). If FALSE the models need to have been trained on word embeddings created with dim_names FALSE, so that embeddings were only called Dim1, Dim2 etc.
- language_distribution
(character column; only for "text-trained" models) If you provide the raw language data used for making the embeddings used for assessment, the language distribution (i.e., a word and frequency table) will be compared with saved one in the model object (if one exists). This enables calculating similarity scores.
- language_distribution_min_words
(string or numeric; only for "text-trained" models) Default is to use the removal threshold used when creating the distribution in the in the training set ("trained_distribution_min_words"). You can set it yourself with a numeric value.
- save_model
(boolean; only for "text-trained"-models) The model will by default be saved in your work-directory (default = TRUE). If the model already exists in your work-directory, it will automatically be loaded from there.
- threshold
(numeric; only for "text-trained"-models) Determine threshold if you are using a logistic model (default = 0.5).
- show_texts
(boolean; only for "implicit-motives"-models) Show texts together with predictions (default = FALSE).
- device
Name of device to use: 'cpu', 'gpu', 'gpu:k' or 'mps'/'mps:k' for MacOS, where k is a specific device number such as 'mps:1'.
- participant_id
(list; only for "implicit-motives"-models) Vector of participant-ids. Specify this for getting person level scores (i.e., summed sentence probabilities to the person level corrected for word count). (default = NULL)
- save_embeddings
(boolean; only for "text-trained"-models) If set to TRUE, embeddings will be saved with a unique identifier, and will be automatically opened next time textPredict is run with the same text. (default = TRUE)
- save_dir
(character; only for "text-trained"-models) Directory to save embeddings. (default = "wd" (i.e, work-directory))
- save_name
(character; only for "text-trained"-models) Name of the saved embeddings (will be combined with a unique identifier). (default = ""). Obs: If no save_name is provided, and model_info is a character, then save_name will be set to model_info.
- story_id
(vector; only for "implicit-motives"-models) Vector of story-ids. Specify this to get story level scores (i.e., summed sentence probabilities corrected for word count). When there is both story_id and participant_id indicated, the function returns a list including both story level and person level prediction corrected for word count. (default = NULL)
- dataset_to_merge_assessments
(R-object, tibble; only for "implicit-motives"-models) Insert your data here to integrate predictions to your dataset, (default = NULL).
- previous_sentence
(boolean; only for "implicit-motives"-models) If set to TRUE, word-embeddings will be averaged over the current and previous sentence per story-id. For this, both participant-id and story-id must be specified.
- tokenizer_parallelism
(boolean; only for "fine-tuned"-models) If TRUE this will turn on tokenizer parallelism.
- logging_level
(string; only for "fine-tuned"-models) Set the logging level. Options (ordered from less logging to more logging): critical, error, warning, info, debug
- force_return_results
(boolean; only for "fine-tuned"-models) Stop returning some incorrectly formatted/structured results. This setting does CANOT evaluate the actual results (whether or not they make sense, exist, etc.). All it does is to ensure the returned results are formatted correctly (e.g., does the question-answering dictionary contain the key "answer", is sentiments from textClassify containing the labels "positive" and "negative").
- return_all_scores
(boolean; only for "fine-tuned"-models) Whether to return all prediction scores or just the one of the predicted class.
- function_to_apply
(string; only for "fine-tuned"-models) The function to apply to the model outputs to retrieve the scores.
- set_seed
(Integer; only for "fine-tuned" models) Set seed.
- ...
Setting from stats::predict can be called.
See also
See textTrain
, textTrainLists
and
textTrainRandomForest
.
Examples
if (FALSE) { # \dontrun{
# Text data from Language_based_assessment_data_8
text_to_predict <- "I am not in harmony in my life as much as I would like to be."
# Example 1: (predict using pre-made embeddings and an R model-object)
prediction1 <- textPredict(
model_info = trained_model,
word_embeddings_4$texts$satisfactiontexts
)
# Example 2: (predict using a pretrained github model)
prediction2 <- textPredict(
texts = text_to_predict,
model_info = "https://github.com/CarlViggo/pretrained-models/raw/main/trained_hils_model.RDS"
)
# Example 3: (predict using a pretrained logistic github model and return
# probabilities and classifications)
prediction3 <- textPredict(
texts = text_to_predict,
model_info = "https://github.com/CarlViggo/pretrained-models/raw/main/
trained_github_model_logistic.RDS",
type = "class_prob",
threshold = 0.7
)
# Example 4: (predict from texts using a pretrained model stored in an osf project)
prediction4 <- textPredict(
texts = text_to_predict,
model_info = "https://osf.io/8fp7v"
)
##### Automatic implicit motive coding section ######
# Create example dataset
implicit_motive_data <- dplyr::mutate(.data = Language_based_assessment_data_8,
participant_id = dplyr::row_number())
# Code implicit motives.
implicit_motives <- textPredict(
texts = implicit_motive_data$satisfactiontexts,
model_info = "implicit_power_roberta_large_L23_v1",
participant_id = implicit_motive_data$participant_id,
dataset_to_merge_assessments = implicit_motive_data
)
# Examine results
implicit_motives$sentence_predictions
implicit_motives$person_predictions
} # }
if (FALSE) { # \dontrun{
# Examine the correlation between the predicted values and
# the Satisfaction with life scale score (pre-included in text).
psych::corr.test(
predictions1$word_embeddings__ypred,
Language_based_assessment_data_8$swlstotal
)
} # }