textTokenize() tokenizes according to different huggingface transformers
textTokenize(
texts,
model = "bert-base-uncased",
max_token_to_sentence = 4,
device = "cpu",
tokenizer_parallelism = FALSE,
model_max_length = NULL,
hg_gated = FALSE,
hg_token = Sys.getenv("HUGGINGFACE_TOKEN", unset = ""),
trust_remote_code = FALSE,
logging_level = "error"
)
A character variable or a tibble/dataframe with at least one character variable.
Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base".
(numeric) Maximum number of tokens in a string to handle before switching to embedding text sentence by sentence.
Name of device to use: 'cpu', 'gpu', 'gpu:k' or 'mps'/'mps:k' for MacOS, where k is a specific device number.
If TRUE this will turn on tokenizer parallelism. Default FALSE.
The maximum length (in number of tokens) for the inputs to the transformer model (default the value stored for the associated model).
Set to TRUE if the accessed model is gated.
The token needed to access the gated model. Create a token from the ['Settings' page](https://huggingface.co/settings/tokens) of the Hugging Face website. An an environment variable HUGGINGFACE_TOKEN can be set to avoid the need to enter the token each time.
use a model with custom code on the Huggingface Hub
Set the logging level. Default: "warning". Options (ordered from less logging to more logging): critical, error, warning, info, debug
Returns tokens according to specified huggingface transformer.
see textEmbed
# \donttest{
# tokens <- textTokenize("hello are you?")
# }