The end of 'data is oil' for BIG TECH: the ascent of synthetic data
About 6-7 years ago major AI players (Google, FB, Microsoft, you name it) were starting to release they AI code and libraries. Sometimes even pretrained models (like word2vec) were made readily and easily available.
The common rationale at that time was: big Tech can release the code knowing very well that only them can have the data to train the models. Hence, their power was, and it will always be, undisputed. A sophisticated model without the data is meaningless.
This of course generated various different views how data can be treated, who should own them and the consequences of it.
In practice, 6-7 years ago it seemed obvious that the Big Tech would always rule the vendors market and control the algorithms since, apart from public institutions, they are the only ones to have PETAbytes of data (or more) from customers and aggregated sources. That was the undisputed source of their immense power and 'data is oil' kind of thought was believed true.
Today is different. Very.
We now have change to believe that power of 'real data' control is actually not as strong as we believed.
As AI Technologies director, I gave an entire talk at Hack 2021 (https://athack.com/) and soon at IDC CIO Summit (https://www.idc.com/mea/events/69074-idc-middle-east-cio-summit-2022#section_7_ about how actually the future is more relieant of 'synthetic data' and the way to generate it that real data (even for big tech).
Yes, there have been regulations and restrictions on how customers data can be used by private companies (GDPR in EU) and that certainly affects the results. But this not the main reason.
领英推荐
I give here a couple of examples why likely real data are not enough in majority of cases:
-GPT-3 model the entire English wiki contributes to about 1% of all training data.... 99% of data is missing. A challenge for all even if you are a Big Player.
-Historic data are actual not that relevant: if you are trying to prevent a fraud online you need to 'generate' how an hacker would invent a 'new' attack more than just stopping the existing ones.
-even large real datasets can be 'replicated' as long as you know the underlined assumptions. Once you know by extraction from real data (or assumption), for example your customers usually buy in a certain range or certain time, you do not need specific sensitive information (which, if stored, you are liable to protect it).
We are essentially reaching the point that the bare accumulation of data does not necessarily add value for the company.
Broadly speaking, nobody would have predicted that 6 years ago. The 'power' and the 'value' is now more on how 'clever' the dataset has been created rather than mere quantity. Quality over quantity.
To the point, Gartner predicted that by 2030 vast majority of the data used in artificial intelligence will be synthetic.
At AI Technologies, we worked extensively with synthetic data in the past few years and we never thought we had a special 'skillset': we just needed to generate 'dummy' data cause naturally many of our clients did not have access to data.
Necessity is truly the mother of all inventions and now we can unlock 'many' projects in which data is scarce for various reasons (at a reduced cost).
The good point for all the AI practitioners is, even without real data, in many cases we can successfully help our clients to deliver effective solutions.
#ai #artificialintelligence #bigdata #cyber #compliance #digitaltransformation
Grid Connection Manager at ib vogt
2 年Bell'articolo Andre. Serve comunque un bel po' di data immagino per identificare chiaramente le underlined assumptions e sintetizzare altro data in maniere accurata e significativa.