Fine-tuning LLMs

MOSTLY AI supports integration with LLMs hosted on Hugging Face to generate privacy-safe synthetic text. If you need to use another LLM, contact MOSTLY AI Support.

When included in a tabular dataset, synthetic text data is intelligently aligned with the rest of the data, ensuring high correlation while safeguarding the privacy of real-world individuals.

Train a generator to fine-tune an LLM

Prerequisites

Hardware. The minimum recommended requirements to fine-tune an LLM with MOSTLY AI are:
- GPU: Nvidia A10g or better (for example, A100, H100, RTX 4090)
- GPU VRAM: at least 24GB
Dataset with unstructured text in a supported format. The dataset you use for fine-tuning must be stored in CSV, Parquet, or TSV format, where one or more of the columns contains unstructured text.
Size of the training dataset. Depending on the size of the dataset and the compute resources you use for training, it is possible that the generator training can fail with out-of-memory errors. For more information, see Troubleshooting.

Steps

Create a new generator and upload a table file or add a table from a data source that contains unstructured text.
On the Data configuration page, for the columns containing unstructured text, make sure that Language/Text is selected as Encoding type.
Click Configure models in the upper right.
On the Model configuration page, expand the language model and configure it.
- For Model, select one of the available language models for training. To add to the list of available models, see Models.
  
  📑
  The list of available models includes HuggingFace text generation models and the not-pre-trained MOSTLY AI LSTM model.
- For Compute, select an available compute. We recommend the use of GPUs for language models. To add to the list of available compute resources, see Computes
  
  📑
  From Compute, you select from a list of compute resources configured for MOSTLY AI. The compute is based on the resources available in the compute cluster where MOSTLY AI is running. - CPU-based computes offer a specific number of CPU cores and memory. They perform better for tabular data computations. - GPU-based computes include a specific number of GPU cores and GPU memory. For language model fine-tuning and language generation, it is typically faster to use GPUs.
Click Start training in the upper right.

Generate text

Open the trained generator and click Generate data in the upper right.
Configure the generation.
1. For Tabular compute, select the compute to use for tabular data. It is usually best to use a CPU-based compute.
2. For Language compute, select the compute to use for language data. GPUs tend to be faster for language generation.
3. (Optional) If needed, adjust the rest of the generation options. For more information about these parameters, see Synthetic data.
Click Start generation in the upper right.

Troubleshooting

It is possible that the model samples of your trained generator include _INVALID_ text values, or that for particularly long texts the generator training fails. This section of explains troubleshooting options.

`_INVALID_` values

If you encounter _INVALID_ values in your model samples or generated synthetic text data, this is likely due to the use of the less efficient CPU compute for fine-tuning, insufficient data, or insufficient training time.

Use GPUs for language model fine-tuning

GPUs are necessary for LLM fine-tuning. To avoid _INVALID_ values, you can use the best practices listed below.

Use GPU computes as explained in Train a generator to fine-tune an LLM.
Use CPU computes only for tabular model training.

Increase Max training time

If you are already using a GPU-enabled compute and still see _INVALID_ values, the next step is to train a new generator with an increased Max training time.

Start by training a new generator with Max training time increased to 20 min. The default is 10 min.
If you still see _INVALID_ values, increase Max training time to 30 min.

For details, see Increase max training time.

Use a training dataset with shorter texts

Using a dataset with shorter texts or create such a dataset by trimming the original length of the texts can also help to avoid _INVALID_ values.

Generator training failures

Depending on the LLM you use, the size of your text data and its dataset, and the compute resources available for fine-tuning, generator training might fail with out-of-memory errors. To troubleshoot, try the suggestions below in the order listed.

Set Batch size to 2 or 4. For details on how to set batch size, see Increase batch size.
Use a training dataset with shorter texts.