Train word embeddings to a numeric variable.

textTrainRegression(
x,
y,
x_append = NULL,
cv_method = "validation_split",
outside_folds = 10,
outside_strata_y = "y",
outside_breaks = 4,
inside_folds = 3/4,
inside_strata_y = "y",
inside_breaks = 4,
model = "regression",
eval_measure = "default",
preprocess_step_center = TRUE,
preprocess_step_scale = TRUE,
preprocess_PCA = NA,
penalty = 10^seq(-16, 16),
mixture = c(0),
first_n_predictors = NA,
impute_missing = FALSE,
method_cor = "pearson",
model_description = "Consider writing a description of your model here",
multi_cores = "multi_cores_sys_default",
save_output = "all",
seed = 2020,
...
)

## Arguments

x

Word embeddings from textEmbed (or textEmbedLayerAggregation). If several word embedding are provided in a list they will be concatenated.

y

Numeric variable to predict.

x_append

Variables to be appended after the word embeddings (x); if wanting to preappend them before the word embeddings use the option first = TRUE. If not wanting to train with word embeddings, set x = NULL.

cv_method

Cross-validation method to use within a pipeline of nested outer and inner loops of folds (see nested_cv in rsample). Default is using cv_folds in the outside folds and "validation_split" using rsample::validation_split in the inner loop to achieve a development and assessment set (note that for validation_split the inside_folds should be a proportion, e.g., inside_folds = 3/4); whereas "cv_folds" uses rsample::vfold_cv to achieve n-folds in both the outer and inner loops.

outside_folds

Number of folds for the outer folds (default = 10).

outside_strata_y

Variable to stratify according (default y; can set to NULL).

outside_breaks

The number of bins wanted to stratify a numeric stratification variable in the outer cross-validation loop.

inside_folds

The proportion of data to be used for modeling/analysis; (default proportion = 3/4). For more information see validation_split in rsample.

inside_strata_y

Variable to stratify according (default y; can set to NULL).

inside_breaks

The number of bins wanted to stratify a numeric stratification variable in the inner cross-validation loop.

model

eval_measure

Type of evaluative measure to select models from. Default = "rmse" for regression and "bal_accuracy" for logistic. For regression use "rsq" or "rmse"; and for classification use "accuracy", "bal_accuracy", "sens", "spec", "precision", "kappa", "f_measure", or "roc_auc",(for more details see the yardstick package).

preprocess_step_center

normalizes dimensions to have a mean of zero; default is set to TRUE. For more info see (step_center in recipes).

preprocess_step_scale

normalize dimensions to have a standard deviation of one. For more info see (step_scale in recipes).

preprocess_PCA

Pre-processing threshold for PCA (to skip this step set it to NA). Can select amount of variance to retain (e.g., .90 or as a grid c(0.80, 0.90)); or number of components to select (e.g., 10). Default is "min_halving", which is a function that selects the number of PCA components based on number of participants and feature (word embedding dimensions) in the data. The formula is: preprocess_PCA = round(max(min(number_features/2), number_participants/2), min(50, number_features))).

penalty

hyper parameter that is tuned

mixture

hyper parameter that is tuned default = 0 (hence a pure ridge regression).

first_n_predictors

by default this setting is turned off (i.e., NA). To use this method, set it to the highest number of predictors you want to test. Then the X first dimensions are used in training, using a sequence from Kjell et al., 2019 paper in Psychological Methods. Adding 1, then multiplying by 1.3 and finally rounding to the nearest integer (e.g., 1, 3, 5, 8). This option is currently only possible for one embedding at the time.

impute_missing

default FALSE (can be set to TRUE if something else than word_embeddings are trained).

method_cor

Type of correlation used in evaluation (default "pearson"; can set to "spearman" or "kendall").

model_description

Text to describe your model (optional; good when sharing the model with others).

multi_cores

If TRUE it enables the use of multiple cores if the computer system allows for it (i.e., only on unix, not windows). Hence it makes the analyses considerably faster to run. Default is "multi_cores_sys_default", where it automatically uses TRUE for Mac and Linux and FALSE for Windows.

save_output

Option not to save all output; default "all". see also "only_results" and "only_results_predictions".

seed

Set different seed.

...

For example settings in yardstick::accuracy to set event_level (e.g., event_level = "second").

## Value

A (one-sided) correlation test between predicted and observed values; tibble of predicted values, as well as information about the model (preprossing_recipe, final_model and model_description).

see textEmbedLayerAggregation textTrainLists textTrainRandomForest textSimilarityTest

## Examples

# \donttest{
results <- textTrainRegression(
x = word_embeddings_4$harmonytext, y = Language_based_assessment_data_8$hilstotal,
multi_cores = FALSE # This is FALSE due to CRAN testing and Windows machines.
)
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 2 breaks instead.
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> Fold: rmse 7.171 (duration: 9.46 secs).
#> Fold: rmse 7.033 (duration: 9.31 secs).
#> Fold: rmse 7.643 (duration: 9.46 secs).
#> Fold: rmse 5.882 (duration: 8.84 secs).
#> Fold: rmse 8.595 (duration: 8.49 secs).
#> Fold: rmse 7.078 (duration: 8.89 secs).
#> Fold: rmse 8.562 (duration: 8.6 secs).
#> Fold: rmse 8.717 (duration: 9.14 secs).
#> Fold: rmse 8.48 (duration: 8.4 secs).
#> Fold: rmse 8.764 (duration: 8.86 secs).
#>
# }