How many examples should I use for fine-tuning my model?

Overview

When training a custom model on Minibase.ai, the number of training examples you provide has a direct impact on performance, accuracy, and reliability. While it is technically possible to train with very small datasets, results will be limited. For most applications, you should aim for a dataset size that gives your model enough coverage to generalize properly.

At a high level:

500 examples is the absolute minimum dataset size. Training on fewer examples typically leads to underfitting and poor results.
3,000 examples is the recommended target for most common applications, balancing effort to collect data with solid, reliable performance.
15,000 examples or more is ideal if you need optimal behavior and the highest-quality results. Large, high-quality datasets are especially important for tasks with complexity, nuance, or critical reliability requirements.

How to Use

Step 1: Decide on your project requirements

If your use case is experimental or just for proof-of-concept, you can start with around 500–1,000 examples to quickly test feasibility.
If your use case is production-oriented and you need consistent accuracy, target 3,000 examples.
If your use case has high performance requirements (e.g., customer-facing AI, specialized workflows, or regulated industries), aim for 15,000 or more examples.

Step 2: Prepare your dataset

Ensure you follow the Minibase schema (instruction, input, response) with required fields filled in.
Organize your dataset into a clean format (CSV, Excel, JSON, or JSONL).
Remove duplicate or low-quality examples, as they reduce training effectiveness.

Step 3: Upload and train

Upload your dataset through the dataset management interface.
Select your training target and confirm the dataset meets the minimum threshold of 500 examples.
Start training and monitor progress.

Tips & Best Practices

Quality matters as much as quantity: A clean, diverse dataset with fewer examples will often outperform a large but messy dataset.
Balance complexity: If your model will handle varied or nuanced tasks, you should lean toward the higher dataset sizes.
Iterate: Start at a smaller scale to validate approach, then grow your dataset incrementally to improve performance.
Use consistent formatting: Make sure instructions, inputs, and responses follow the same style and conventions across all examples.

Troubleshooting

Q: Can I train with fewer than 500 examples?

A: Training is possible, but strongly discouraged. Models trained on fewer than 500 examples rarely produce usable results.

Q: What if I don’t have 3,000 examples yet?

A: You can start training with fewer examples (500–1,000) to test, then add more examples over time and retrain.

Q: My model is underperforming even with 3,000 examples. Why?

A: Check dataset quality. If examples are inconsistent, mislabeled, or not representative of your intended use case, the model will struggle. More examples can help, but quality should be the first area to review.

Need More Help?

Need More Help?
Join our Discord support server to chat with our team and get real-time assistance.