Skip to content

Compute descriptive statistics of character variables.

Usage

textDescriptives(
  words,
  compute_total = TRUE,
  entropy_unit = "log2",
  na.rm = TRUE,
  locale = "en_US"
)

Arguments

words

One or several character variables; if its a tibble or dataframe, all the character variables will be selected.

compute_total

Boolean. If the input (words) is a tibble/dataframe with several character variables, a total variable is computed.

entropy_unit

The unit entropy is measured in. The default is to used bits (i.e., log2; see also, "log", "log10"). If a total score for several varaibles is computed,the text columns are combined using the dplyr unite function. For more information about the entropy see the entropy package and specifically its entropy.plugin function.

na.rm

Option to remove NAs when computing mean, median etc (see under return).

locale

(character string) Locale Identifiers for example in US-English ('en_US') and Australian-English ('en_AU'); see help(about_locale) in the stringi package

Value

A tibble with descriptive statistics, including variable = the variable names of input "words"; w_total = total number of words in the variable; w_mean = mean number of words in each row of the variable; w_median = median number of words in each row of the variable; w_range_min = smallest number of words of all rows; w_range_max = largest number of words of all rows; w_sd = the standard deviation of the number of words of all rows; unique_tokens = the unique number of tokens (using the word_tokenize function from python package nltk) n_token = number of tokens in the variable (using the word_tokenize function from python package nltk) entropy = the entropy of the variable. It is computed as the Shannon entropy H of a discrete random variable from the specified bin frequencies. (see library entropy and specifically the entropy.plugin function)

See also

Examples

if (FALSE) { # \dontrun{
textDescriptives(Language_based_assessment_data_8[1:2])
} # }

GitHub