How much data do I really need to start my fancy AI start-up?
Markus Feigelbinder
CPO and Scaling Coach for B2B SaaS/DeepTech founders. Multiple CEO/CPO/CMO, Ex-Amazon, Ex-Westwing.
These days –and to the surprise of my mother– many people listen to my talks at digital, travel and tech conferences (and often even pay for it ??). Once I've left the stage, usually it only takes seconds until the first one approaches me with a question like that: "I love your product.. Tell me, how much data do I really need to start my fancy AI start-up / project / idea?".
Certainly, data is the currency in artificial intelligence. Many AI experts will tell you that the amount and type of data you have determine whether your AI project becomes a success or a flop. But if you are considering investing in AI capabilities, you first need to understand that in the next five years or so, machines and applications will become more intelligent and less artificial. They will rely more on top-down reasoning, resembling the way humans approach tasks and problems, and less on bottom-up big data.
So how much data is required to get my AI project on its feet?
Many executives and innovators are faced with this question when looking to invest in AI and machine learning, and, unfortunately, the single 'best' answer to that question is yet to be found. 'Real-life', business relevant data is expensive and increasingly difficult to find — now that government actions such as the enactment of GDPR are placing limitations on the use of private data. With that in mind, the focus in AI has switched to developing systems that consume less data without decreasing the quality of the output.
Training data: how much is enough?
Most of the current AI systems are based on the bottom-up approach or deep neural models. This approach entails the use of simple solutions and subsystems, which are then interlinked to form complex systems. The main limitation of bottom-up systems is that you need to feed them with large volumes of training data to get them working effectively. For example, if you are developing an AI system that responds to different customer queries, you have to feed it with data on as many customer behaviour patterns as possible. The system will not be useful if it encounters a situation it's not trained on.
The amount of data needed to train a system sufficiently depends on different factors:
- The complexity of the system: The more parameters in the project, the more data required to train the machine. A system that deals with a specific object will require much less training data than one that accepts input and makes choices.
- The training method used: Different training methods have different learning curves and data requirements. For example, systems trained using structured learning methods need less training data than those that rely on deep learning models.
- Diversity of input data: If many types of input are expected, you will need more training data to coach the system to respond to each input type effectively.
- Error tolerance: The purpose of your AI or machine learning project will dictate its tolerance for errors. For example, mistakes can be tolerated in customer service systems but not in patience support machines. Machines with low error tolerance require more data to train.
Human-like intelligence: The future of AI
As it becomes more difficult and expensive to get large volumes of data, businesses are shunning the traditional bottom-up systems for top-down approaches. Some of the big names that are using top-down AI systems are companies as diverse as Alphabet, Vicarious (developing artificial general intelligence for robots) or even engineering giant Siemens.
Top-down systems mimic human intelligence; they are more flexible, faster, and consume less data than deep neural networks. When evaluating the AI approach to adopt in your business, you should have the following development areas in mind
- Need for efficient robot reasoning: Top-down systems outperform deep neural networks in terms of data efficiency. Vicarious recently built a recursive cortical network (RCN) system that solves CAPTCHAs; the systems is 300-fold more data-efficient compared to bottom-up systems. RCNs imitate human cognitive processes since they use the data they process to learn something new.
- Giving AI systems common sense: There is a need to develop machine systems that do not rely solely on training data. Such systems can learn from experience, communicate, and deal with unprecedented situations effectively.
- Mimicking human expertise: Businesses need to come up with data-efficient systems that can be as good as human workers. This should include modelling how a human expert would solve the problem at hand when given scarce data.
- Making better decisions: Humans evaluate different possible action paths before choosing the best one. Machines can be modelled to follow such reasoning by applying probabilistic models such as Gaussian processes.
The bottom line
Traditional bottom-up AI systems require lots of training data to work efficiently. The amount of training data needed depends on factors such as the complexity of the models and error tolerance.
Moving forward, data is going to be more expensive and difficult to find.
There's no need to have a million unnecessary data points if just 100 detailed and clean data point can serve the intended purpose. Whatever you are building, ensure it will give you the best chance of success by helping you to build a solid foundation. So, if you are kicking off an AI startup or looking to invest in AI, you should consider focusing on top-down AI systems — that's where AI is heading. Such systems aim to mimic human intelligence and are, by far, more date-efficient than traditional deep neural networks.
What is your experience with data-heavy products or business models?
Cheers, Markus
DevOps Engineer | Cloud Architect | AWS certified | Orchestrating seamless CI/CD pipelines | Automating the path from code to production
1 年Great insights!