Quick start: Generators

Generators are models trained on data and used to create synthetic data based on your requirements. See Generators for more information about how generators are trained and the models offered by MOSTLY AI or see Fine-tuning LLMs for instructions on using your own large language model.

Follow these instructions to create a new Generator. You can transfer the Generator to an organization or make it public so others can use it to create synthetic data.

Step 1: Train a generator

On the MOSTLY AI platform, open Generators from the left-side navigation menu.
There are four ways to create a new generator:

Method	Description
Start from a connector	Use an existing connector to train a new generator.
Upload your data	Provide a CSV, Parquet, or TSV file to train a new generator from your local file system.
Use the SDK	Navigate to the Synthetic Data SDK repository.
Import a generator	Upload a configured generator file.

After selecting your training method and uploading any required files, click Configure models.
Each connected or uploaded table supports its own configuration. Expand each table description to customize model behavior.

Method	Description
Model	The model your generator uses to create synthetic data.
Compute	The compute resources used to train the generator.
Training parameters	The model-level parameters which control the training process. Each parameter is defined by a tooltip in the platform.
Differential privacy	Use differential privacy when you need a mathematical guarantee of privacy, with epsilon quantifying the upper bound on an individual’s impact on the trained model.
Flexible generation	Enabled by default, flexible generation gives you the option to apply smart imputation, data rebalancing, seeded generation and apply fairness when you generate synthetic datasets with the model.
Value protection	Value protection prevents membership inference by replacing rare categories and removing extreme values from your dataset.
Model report	Enabled by default, the Model report provides metrics and charts to gauge the quality of a model. The calculated metrics and charts include accuracy, similarity, and distances between original and synthetic samples, and the correlations, univariate, and bivariate distribution charts to compare the original and synthetic correlations and distributions.

💡

MOSTLY AI offers three training Presets in the Model configuration section header if you don’t want to configure individual parameters: Accuracy, Speed, and Turbo.

In the Model configuration section header, you can optionally configure Random State which is a seed value to ensure reproducible results during training. If left empty, a random seed will be used each time.
After completing configuration, click Start training to begin the training process.

Follow progress in the Training status section on the generator page.

Start a new chat with the Assistant by clicking New chat in the left-side navigation menu.
Prompt the Assistant to connect to a configured dataset or upload a dataset file into the Assistant workspace.

Connect to the Berka dataset and briefly describe this resource.

Prompt the Assistant to create a generator with the defined resource.

Configure a generator that will produce data which follows the statistical patterns of the least active accounts in the dataset.

Install the MOSTLY AI Synthetic Data SDK.

You can install and use the SDK in Local or Client mode.
- In Local mode, you use the SDK with the compute resources on your local machine (or any Python environment) to train generators and create synthetic datasets.
- In Client mode, you connect to a remote MOSTLY AI Platform instance and use its available compute resources. For details, see Local and Client modes.
Create your first generator using the US Census Income dataset, start its training, and wait for it to finish.

python

import pandas as pd
df = pd.read_csv("https://docs.mostly.ai/public-datasets/us-census-income.csv.gz")
 
# 2. Instantiate in Local use
from mostlyai.sdk import MostlyAI
mostly = MostlyAI(local=True)
 
# 3. Create a generator and launch its training
g = mostly.train(data=df, start=True, wait=True)

What’s next

After the generator training completes, you can generate synthetic data. You can also transfer it to an organization. Data consumers from your organization can then to generate synthetic data.

Open the generator from the Generators page.
Click Share in the upper-right corner.
Select the organization to which to share the resource using the Owner dropdown and click Save.

The generator is now available to all members of your organization and they can use it to generate synthetic data.

What’s next

You can make the generator public so that it is available to all logged-in users in the Platform.

Step 3: Make the generator public

Generators capture only the statistical characteristics of the original data and they never memorize data points. It is perfectly safe to make a generator public.

Open the generator from the Generators page.
Click Share.
Select Public from the Visibility dropdown and click Save.

Result

Public generators of organizations are listed on the organization’s profile page.

They are also available on the Generators page for all logged-in users.

What’s next

To generate synthetic data, see Quickstart: Synthetic data.

Quick start: Generators

Step 1: Train a generator

Step 2: Share the generator with your organization

Step 3: Make the generator public