Synthetic datasets

Synthetic datasets

To generate synthetic data in MOSTLY AI, you start a new synthetic dataset. You can view all finished, canceled, failed, and in-progress synthetic datasets on the Synthetic datasets page.

What is a synthetic dataset?

A synthetic dataset contains the generated (single- or multi-table) data as well as a number of additional artifacts.

  • Generated synthetic data (available to download in CSV, Parquet, XLSX formats)
  • Usage statistics
    • Generated data points
    • Credits used
  • Data insights
    • Generator quality - Overall, Univariate, Bivariate, Coherence
    • Distances
    • Model report for the quality of the generator
    • Data report for the quality of the synthetic dataset
  • Data samples - 10 generated samples from the generated data (that you can resample as needed)
  • Configuration
    • JSON dictionary of the synthetic dataset configuration
    • Synthetic Data SDK code to access the synthetic data via Python or Jupyter Notebook

Create a synthetic dataset

For more information, see Generate single- and multi-table synthetic datasets.

Configure a synthetic dataset

| | |-| | Select a compute environment | | Set sample size and temperature | | Rebalance columns | | Impute data | | Generate fair synthetic data | | Use a seed dataset for conditional simulation | | Evaluate quality | | Deliver to databases and cloud buckets |