I have been writing and talking about the myth of ‘data is the new oil’ for a while.

Data, especially data of individuals, was never exactly oil although it was portrayed as such.

Let me explain.

I believe around 2014 many BigTech were releasing AI code while not revealing exactly which data they use to train models. That’s when everyone started to think ‘hey, they release the models to all but having the data makes them so powerful. A model with no data is training it, it’s useless.’

Fastforwards 2023, why are BigTech not more dominating? In fact even Google’s business model is jeopardised by OpenAI ChatGPT, a new player.

Soon ‘reduced’ versions of chatGPT will have ‘similar’ performances of the larger models (I am talking of GPT-j for example) Simply put, data , even your data was never exactly oil. And sure your individual data is somehow valuable. If I am not mistaken Azeem Azhar put such a value at … 10$ per person yearly (from Exponential book).

Why then super sensitive dataset aren’t super valuable?

-super large dataset (like of consumers), they quickly reach a point where bare accumulation of data does not add value. It is actually repetitive. Same behaviours, different (anonymized) names (unless we are talking about real time data. That’s different as it changes over time).

-Historic data are actually not that relevant: if you are trying to prevent a fraud online you need to ‘generate’ how a hacker would invent a ‘new’ attack more than just stopping the existing ones.

-even large real datasets can be ‘replicated’ as long as you know the underlying assumptions. Once you know by extraction from real data (or assumption), for example your customers usually buy in a certain range or certain time, you do not need specific sensitive information (which, if stored, you are liable to protect it).
-even when that is possible and data is available, it may be costly preprocessing or have legal/compliance issues to actually employ such data.

These above are just a few (but major) reasons why, instead of using real data, many applications can just use generated (i.e. synthetic) data to train AI models.

Other reasons are, real data is simply not available (cybersecurity detection of new attacks for example) or large models (LLMs) naturally need more data to be trained than simple (logistic regression, SVM etc.) models.
In these circumstances, synthetic data is actually the best solution available. Hence it is raising in adoption. By 2030, according to Gartner, synthetic data will be the vast majority of data used by AI models (more than 70%) and growing.

Fast forward 2 years, where are we ?

First Large Language Models (LLMs) are indeed on the rise and they need more data than ever. Yet, it is hard to give numbers as companies do not release information about their training (OpenAI for example with ChatGPT or Bard from Google). Then there are simulations. Or any sort. Unreal engine 5 for example can be equally used for videogames, building simulations and, to an extent, drone simulations.
Needless to say, drones are on the rise (aerial, marine and terrestrial). And with them, we need systems to train drones to perform tasks from agriculture to defence, to autonomous vehicles technologies (AV) too.

Many companies and consultancies are emerging and getting traction purely providing synthetic data generation and modelling. And they may provide synthetic data that has better quality of real data (sometimes a real image for example has bad pixels or it is simply blurry or labelled wrongly).

From what I wrote so far you can easily reach the conclusion that synthetic data is becoming dominant. Not so fast.

AI applications can be trained in the vast majority with synthetic data… until they reach the live stage.

After that, the re-training will be on real or mixed data. Probably 50/50 real synthetic (application dependent).

Yes, there are applications which may never have real data, like videogames or specific cybersecurity cases. But I would consider that a border case scenario not the norm.

Here is really ‘why’ synthetic data are so valuable:

To unlock AI projects otherwise never starting due to real data limitation or restrictions. And in the ‘unlocking’ phase (before live stage), the synthetic data is going to be dominant. Once the AI solution is live, the AI solution will consume as much real data as synthetic (if no more). Definitely real data value is not what we expected to be 5 years ago. Or probably it never was. On the contrary synthetic data seem to be perceived as more valuable than they are now. Especially because now self labelling techniques are available and affordable. But that’s another story. We shall see.


Please enter your comment!
Please enter your name here