How does the synthetic data generation tool work?

Overview

The Synthetic Data Generator allows you to create custom training datasets by automatically generating examples based on your goals and seed inputs. This tool is especially useful if you don’t have enough real-world data or if you want to expand your dataset quickly while still maintaining control over structure and quality.

The generation process has four main steps:

Define your goals
Configure your dataset
Provide seed examples
Generate the full dataset

How to Use

Step 1: Define your goals

Start by clearly stating what you are trying to accomplish with your dataset.
Be specific about what type of behavior you want your model to learn, and how the examples should look.
This information is shared with the underlying AI models that collaborate to generate synthetic examples tailored to your needs.

Step 2: Configure your dataset

Set the high-level parameters: how many examples to generate and any restrictions to apply.
Add a description of the dataset’s purpose. This serves two functions:
- Helps you remember why you created the dataset.
- Provides context to others if you choose to share it in the Minibase marketplace.

Step 3: Provide seed examples
This is the most important part of the process. You’ll provide 5–20 seed examples that guide the generator. To achieve the best quality results:

Use realistic Instructions and Inputs that reflect what your model will see in production.
Use Response to be explicit about the desired output in each situation.
Aim for 20 examples if possible—the more seeds, the better.
If you’re building a single-task model, reuse the same Instruction across all examples for clarity and consistency.
Vary your Inputs if the real-world data will be varied—mimic the diversity you expect in production.
For training size, we recommend generating at least 10,000 examples as a starting point. You can increase this number later if needed.

Step 4: Generate your dataset

After providing goals, configuration, and seeds, run the generator to create the synthetic dataset.
Once complete, the dataset will be available in your account for training, fine-tuning, or sharing in the marketplace.

Tips & Best Practices

Clarity is key: The clearer your goals and seeds, the higher the quality of generated examples.
Match production reality: Make seed examples resemble real-world inputs as closely as possible.
Use consistent Instructions: Don’t overcomplicate things if your model has one primary task.
Scale thoughtfully: Start with ~10k examples and expand once you see results.

Troubleshooting

My generated dataset doesn't look right.

Check your seed examples. If they’re inconsistent or unclear, the generated data will be too. Refine your seeds for better results.

My generated dataset doesn't have much variety.

Increase the variety of your Inputs in the seed examples. The generator mirrors the diversity it sees.

The generated dataset is too small.

Increase the number of examples in the configuration step. Aim for 10k+ examples for robust training.