R/2_5_textTrainN.R
textTrainN.Rd
(experimental) Compute cross-validated correlations for different sample-sizes of a data set. The cross-validation process can be repeated several times to enhance the reliability of the evaluation.
textTrainN(
x = word_embeddings_4$texts$harmonytext,
y = Language_based_assessment_data_8$hilstotal,
sample_percents = c(25, 50, 75, 100),
n_cross_val = 1,
seed = 2023
)
Word embeddings from textEmbed (or textEmbedLayerAggregation). If several word embedding are provided in a list they will be concatenated.
Numeric variable to predict.
(numeric) Numeric vector that specifies the percentages of the total number of data points to include in each sample (default = c(25,50,75,100), i.e., correlations are evaluated for 25 the datapoints). The datapoints in each sample are chosen randomly for each new sample.
(numeric) Value that determines the number of times to repeat the cross-validation. (default = 1, i.e., cross-validation is only performed once). Warning: The training process gets proportionately slower to the number of cross-validations, resulting in a time complexity that increases with a factor of n (n cross-validations).
(numeric) Set different seed (default = 2023).
A tibble containing correlations for each sample. If n_cross_val > 1, correlations for each new cross-validation, along with standard-deviation and mean correlation is included in the tibble. The information in the tibble is visualised via the textTrainNPlot function.
See textTrainNPlot
.
# Compute correlations for 25%, 50%, 75% and 100% of the data in word_embeddings and perform
# cross-validation thrice.
if (FALSE) {
tibble_to_plot <- textTrainN(
x = word_embeddings_4$texts$harmonytext,
y = Language_based_assessment_data_8$hilstotal,
sample_percents = c(25,50,75,100),
n_cross_val = 3,
)
# tibble_to_plot contains correlation-coefficients for each cross_validation and
# standard deviation and mean value for each sample. The tibble can be plotted
# using the testTrainNPlot function.
# Examine tibble
tibble_to_plot
}