The promise and perils of synthetic data

Is it possible for an AI to be trained just on data generated by another AI? It might sound like a harebrained idea. But it’s one that’s been around for quite some time — and as new, real data is increasingly hard to come by, it’s been gaining traction.

Anthropic used some synthetic data to train one of its flagship models, Claude 3.5 Sonnet. Meta fine-tuned its Llama 3.1 models using AI-generated data. And OpenAI is said to be sourcing synthetic training data from o1, its “reasoning” model, for the upcoming Orion.

But why does AI need data in the first place — and what kind of data does it need? And can this data really be replaced by synthetic data?

The importance of annotations

AI systems are statistical machines. Trained on a lot of examples,…

Source link

Leave a Comment