Task Adapted Pre-Training (EXPERIMENTAL - under development)

textFineTuneTask(
  text_outcome_data,
  model_name_or_path = "bert-base-uncased",
  output_dir = "./runs",
  validation_proportion = 0.1,
  evaluation_proportion = 0.1,
  is_regression = TRUE,
  config_name = NULL,
  tokenizer_name = NULL,
  max_seq_length = 128L,
  evaluation_strategy = "epoch",
  eval_accumulation_steps = NULL,
  num_train_epochs = 3,
  past_index = -1,
  set_seed = 2022,
  label_names = NULL,
  pytorch_mps_high_watermark_ratio = FALSE,
  tokenizer_parallelism = FALSE,
  ...
)

Arguments

text_outcome_data

A dataframe, where the first column contain text data, and the second column the to-be-predicted variable (numeric or categorical).

model_name_or_path

(string) Path to foundation/pretrained model or model identifier from huggingface.co/models

output_dir

(string) Path to the output directory.

validation_proportion

(Numeric) Proportion of the text_outcome_data to be used for validation.

evaluation_proportion

(Numeric) Proportion of the text_outcome_data to be used for evaluation.

is_regression

(Boolean) TRUE for regression tasks, FALSE for classification.

config_name

(String) Pretrained config name or path if not the same as model_name.

tokenizer_name

(String) Pretrained tokenizer name or path if not the same as model_name

max_seq_length

(Numeric) The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.

evaluation_strategy

(String or IntervalStrategy) — The evaluation strategy to adopt during training. Possible values are: "no": No evaluation is done during training. "steps": Evaluation is done (and logged) every eval_steps. "epoch": Evaluation is done at the end of each epoch.

eval_accumulation_steps

(Integer) Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but requires more memory).

num_train_epochs

(Numeric) Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).

past_index

(Numeric, defaults to -1) Some models like TransformerXL or XLNet can make use of the past hidden states for their predictions. If this argument is set to a positive int, the Trainer will use the corresponding output (usually index 2) as the past state and feed it to the model at the next training step under the keyword argument mems.

set_seed

(Numeric) Set the seed

label_names

label name in case of classification; e.g., label_names = c("female", "male").

pytorch_mps_high_watermark_ratio

Set to TRUE to solve error RuntimeError: MPS backend out of memory.Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). Monitor System Resources: If you decide to adjust this setting, closely monitor your system's resource usage to ensure it does not become unstable.

tokenizer_parallelism

(boolean) If TRUE this will turn on tokenizer parallelism. Default FALSE.

...

Parameters related to the fine tuning, which can be seen in the text-package file inst/python/arg2.json.

Value

A folder containing the pretrained model and output data. The model can then be used, for example, by textEmbed() by providing the model parameter with a the path to the output folder.

Details

Information about more parameters see inst/python/args2.json (https://github.com/OscarKjell/text/tree/master/inst/python/args2.json). Descriptions of settings can be found in inst/python/task_finetune.py under "class ModelArguments" and "class DataTrainingArguments" as well as online at https://huggingface.co/docs/transformers/main_classes/trainer.

See also

Examples

if (FALSE) { # \dontrun{
textFineTuneTask(text_outcome_data)
} # }