textTopics() trains a BERTopic model (via the bertopic Python package) on a
text variable in a tibble/data.frame. The function embeds documents, reduces
dimensionality (UMAP), clusters documents (HDBSCAN), and extracts topic representations
using c-TF-IDF with optional KeyBERT/MMR-based representation. (EXPERIMENTAL)
Usage
textTopics(
data,
variable_name,
embedding_model = "distilroberta",
representation_model = c("mmr", "keybert"),
umap_n_neighbors = 15L,
umap_n_components = 5L,
umap_min_dist = 0,
umap_metric = "cosine",
hdbscan_min_cluster_size = 5L,
hdbscan_min_samples = NULL,
hdbscan_metric = "euclidean",
hdbscan_cluster_selection_method = "eom",
hdbscan_prediction_data = TRUE,
num_top_words = 10L,
n_gram_range = c(1L, 3L),
stopwords = "english",
min_df = 5L,
bm25_weighting = FALSE,
reduce_frequent_words = TRUE,
set_seed = 8L,
save_dir
)Arguments
- data
A
tibble/data.framecontaining a text variable to analyze and, optionally, additional numeric/categorical variables that can be used in later analyses (e.g., testing topic prevalence differences across groups).- variable_name
A character string giving the name of the text variable in
datato perform topic modeling on.- embedding_model
A character string specifying which embedding model to use. Common options include
"miniLM","mpnet","multi-mpnet", and"distilroberta". The choice affects topic quality, speed, and memory usage.- representation_model
A character string specifying the topic representation method. Must be one of
"mmr"or"keybert"."keybert"uses embedding similarity to select representative words/phrases."mmr"(Maximal Marginal Relevance) promotes diversity among selected terms.
- umap_n_neighbors
Integer. Number of neighbors used by UMAP to balance local versus global structure. Smaller values emphasize local clusters; larger values emphasize global structure.
- umap_n_components
Integer. Number of dimensions to reduce to with UMAP (the embedding space used for clustering).
- umap_min_dist
Numeric. Minimum distance between embedded points in UMAP. Smaller values typically yield tighter clusters.
- umap_metric
Character string specifying the distance metric used by UMAP, e.g.
"cosine".- hdbscan_min_cluster_size
Integer. The minimum cluster size for HDBSCAN. Larger values yield fewer, broader topics; smaller values yield more, finer-grained topics.
- hdbscan_min_samples
Integer or
NULL. Controls how conservative clustering is. IfNULL, HDBSCAN chooses a default.- hdbscan_metric
Character string specifying the metric used by HDBSCAN, typically
"euclidean"when clustering in reduced UMAP space.- hdbscan_cluster_selection_method
Character string specifying cluster selection strategy. Either
"eom"(excess of mass; often yields more stable clusters) or"leaf"(can yield more fine-grained clusters).- hdbscan_prediction_data
Logical. If
TRUE, stores additional information enabling approximate topic prediction for new documents (when supported by the underlying pipeline).- num_top_words
Integer. Number of top terms to return per topic.
- n_gram_range
Integer vector of length 2 giving the min and max n-gram length used by the vectorizer (e.g.,
c(1L, 3L)).- stopwords
Character string naming the stopword dictionary to use (e.g.
"english").- min_df
Integer. Minimum document frequency for terms included in the vectorizer.
- bm25_weighting
Logical. If
TRUE, uses BM25 weighting in the class-based TF-IDF transformer (can improve term weighting in some corpora).- reduce_frequent_words
Logical. If
TRUE, down-weights very frequent words using the class-based TF-IDF transformer.- set_seed
Integer. Random seed used to initialize UMAP (and other stochastic components) for reproducibility.
- save_dir
Character string specifying the directory where outputs should be saved. A folder will be created (or reused) to store the fitted model and derived outputs.
Value
A named list containing:
- train_data
The training data used to fit the model (or loaded from disk if available).
- preds
A document-by-topic matrix of normalized topic mixtures (LDA-like). Rows typically sum to 1; rows of zeros can occur if no topic mass was assigned.
- doc_info
Document-level outputs including hard topic labels (
-1indicates outliers).- topic_info
Topic-level outputs including topic sizes and top terms.
- model
The fitted BERTopic model object (Python-backed).
- model_type
Model identifier (currently
"bert_topic").- seed
Random seed used.
- save_dir
Directory where artifacts were saved.
Details
Typical tuning levers:
More topics / finer clusters: decrease
hdbscan_min_cluster_size, decreaseumap_n_neighbors, and/or increaseumap_n_components.Fewer topics / broader clusters: increase
hdbscan_min_cluster_sizeand/or increaseumap_n_neighbors.More phrase-like terms: increase
n_gram_rangemax (e.g., up to 3).Cleaner vocabulary: increase
min_df, and usereduce_frequent_words = TRUE.

