Tokenize according to different huggingface transformers
textTokenize(
texts,
model = "bert-base-uncased",
max_token_to_sentence = 4,
device = "cpu",
tokenizer_parallelism = FALSE,
model_max_length = NULL,
logging_level = "error"
)
A character variable or a tibble/dataframe with at least one character variable.
Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base".
(numeric) Maximum number of tokens in a string to handle before switching to embedding text sentence by sentence.
Name of device to use: 'cpu', 'gpu', 'gpu:k' or 'mps'/'mps:k' for MacOS, where k is a specific device number.
If TRUE this will turn on tokenizer parallelism. Default FALSE.
The maximum length (in number of tokens) for the inputs to the transformer model (default the value stored for the associated model).
Set the logging level. Default: "warning". Options (ordered from less logging to more logging): critical, error, warning, info, debug
Returns tokens according to specified huggingface transformer.
see textEmbed
# \donttest{
# tokens <- textTokenize("hello are you?")
# }