textTopics creates and trains a BERTopic model (based on bertopic python packaged) on a text-variable in a tibble/data.frame. (EXPERIMENTAL)
textTopics(
data,
variable_name,
embedding_model = "distilroberta",
umap_model = "default",
hdbscan_model = "default",
vectorizer_model = "default",
representation_model = "mmr",
num_top_words = 10,
n_gram_range = c(1, 3),
stopwords = "english",
min_df = 5,
bm25_weighting = FALSE,
reduce_frequent_words = TRUE,
set_seed = 8,
save_dir
)
(tibble/data.frame) A tibble with a text-variable to be analysed, and optional numeric/categorical variables that you might want to use for later analyses testing the significance of topics in relation to these variables.
(string) Name of the text-variable in the data tibble that you want to perform topic modeling on.
(string) Name of the embedding model to use such as "miniLM", "mpnet", "multi-mpnet", "distilroberta".
(string) The dimension reduction algorithm, currently only "default" is supported.
(string) The clustering algorithm to use, currently only "default" is supported.
(string) Name of the vectorizer model, currently only "default" is supported.
(string) Name of the representation model used for topics, including "keybert" or "mmr".
(integer) Determine the number of top words presented for each topic.
(vector) Two-dimensional vector indicating the ngram range used for the vectorizer model.
(string) Name of the stopword dictionary to use.
(integer) The minimum document frequency of terms.
(boolean) Determine whether bm25_weighting is used for ClassTfidfTransformer.
(boolean) Determine whether frequent words are reduced by ClassTfidfTransformer.
(integer) The random seed for initialization of the umap model.
(string) The directory for saving results.
A folder containing the model, data, folder with terms and values for each topic,
and the document-topic matrix. Moreover the model itself is returned formatted as a data.frame
together with metdata.
See textTopicsReduce
textTopicsTest
and textTopicsWordcloud
.