Questions every VC needs to ask about every AI startup’s tech stack
Interrogate the hype to find the winners
By Leonard Wossnig, CTO at LabGenius | Written for TechCrunch and originally published September 18th 2023 here.
From fraud detection to agricultural crop monitoring, a new wave of tech startups has emerged, all armed with the conviction that their use of AI will address the challenges presented by the modern world.
However, as the AI landscape matures, a growing concern comes to light: The heart of many AI companies, their models, are rapidly becoming commodities. A noticeable lack of substantial differentiation among these models is beginning to raise questions about the sustainability of their competitive advantage.
Instead, while AI models continue to be pivotal components of these companies, a paradigm shift is underway. The true value proposition of AI companies now lies not just within the models, but also predominantly in the underpinning datasets. It is the quality, breadth, and depth of these datasets that enable models to outshine their competitors.
However, in the rush to market, many AI-driven companies, including those venturing into the promising field of biotechnology, are launching without the strategic implementation of a purpose-built technology stack that generates the indispensable data required for robust machine learning. This oversight carries substantial implications for the longevity of their AI initiatives.
The true value proposition of AI companies now lies not just within the models, but also predominantly in the underpinning datasets.
As seasoned venture capitalists (VCs) will be well aware, it’s not enough to scrutinize the surface-level appeal of an AI model. Instead, a comprehensive evaluation of the company’s tech stack is needed to gauge its fitness for purpose. The absence of a meticulously crafted infrastructure for data acquisition and processing could potentially signal the downfall of an otherwise promising venture right from the outset.
In this article, I offer practical frameworks derived from my hands-on experience as both CEO and CTO of machine learning–enabled startups. While by no means exhaustive, these principles aim to provide an additional resource for those with the difficult task of assessing companies’ data processes and the resulting data’s quality and, ultimately, determining whether they are set up for success.
From inconsistent datasets to noisy inputs, what could go wrong?
Before jumping into the frameworks, let’s first assess the basic factors that come into play when assessing data quality. And, crucially, what could go wrong if the data’s not up to scratch.
Relevance
First, let’s consider datasets’ relevance. Data must intricately align with the problem that an AI model is trying to solve. For instance, an AI model developed to predict housing prices necessitates data encompassing economic indicators, interest rates, real income, and demographic shifts.
Similarly, in the context of drug discovery, it’s crucial that experimental data exhibits the highest possible predictiveness for the effects in patients, requiring expert thought about the most relevant assays, cell lines, model organisms, and more.
Accuracy
Second, the data must be accurate. Even a small amount of inaccurate data can have a significant impact on the performance of an AI model. This is especially poignant in medical diagnoses, where a small error in the data could lead to a misdiagnosis and potentially affect lives.
Coverage
Third, coverage of data is also essential. If the data is missing important information, then the AI model will not be able to learn as effectively. For example, if an AI model is being used to translate a particular language, it is important that the data includes a varietyof different dialects.
For language models, this is referred to as a “low resource” versus “high resource” language dataset. This also requires having a complete understanding of the confounding factors that affect the outcome, which typically requires the collection of metadata.
Bias
Finally, data bias also warrants rigorous consideration. Data should be captured in an unbiased way to avoid human prejudice or bias on the model. For instance, image recognition data should minimize stereotypes. In drug discovery, datasets should encompassboth successful and unsuccessful molecules to avoid skewed outcomes. In both cases, the data would be considered as biased and likely lose its ability to make novel predictions.
The repercussions of subpar data shouldn’t be underestimated. At best, they result in a model that underperforms, and at worst, they render the model entirely ineffective. This can lead to financial losses, missed opportunities, and even physical harm.
Similarly, if the data is biased, the models will produce biased results, which can foster discrimination and unjust practices. This has been a particular concern with large language models, which have come under recent scrutiny for perpetuating stereotypes.
Compromised data quality also has the potential to erode effective decision making, which can ultimately result in poor business performance.
Framework 1: Tech stack pyramid for data generation
To avoid investment in ineffectual AI startups, there is a need to first evaluate the processes behind the data. Picturing a company’s tech stack as a pyramid is a good place to start, where the foundational tiers tend to have the biggest impact on the predictiveoutcome. Without this solid base, even the best data analysis and machine learning models face significant constraints.
领英推荐
Here are some basic questions that a VC might initially ask to figure out if a startup’s data generation process can actually create usable results for AI:
Receiving robust answers to these questions can help determine a company’s grasp of the underpinning principles of their data pipelines. This understanding, in turn, will help gauge the quality of the model’s output.
Framework 2: The five V’s of data quality
Once a company’s tech stack has been deemed suitable for AI, there is also a need to carefully consider the quality of the resulting data being used to train its models. A common framework used to capture the classification of data quality is the five V’s of dataquality. They represent five key dimensions of data quality that VCs should consider when evaluating AI startups:
Here are some introductory questions to help evaluate a company’s data for the five V’s:
Here are some introductory questions to help evaluate a company’s data for the five V’s:
By carefully considering the five V’s of data quality, VCs can make sure they are investing in AI startups that have the data they need to succeed. If the startup can answer the above questions convincingly and their data scores highly in the five dimensions, it is a good sign that they are serious about data quality and are properly equipped to apply their AI models.?
Finally, VCs should assess the startup’s commitment to data security. This includes things like their data governance policies, their data quality assurance procedures, and their data breach response plans.?
Interrogate the hype to find the winners?
Amid the resounding buzz surrounding AI in recent months, the allure of substantial investments has attracted startup founders willing to exaggerate their infrastructure and inflate capabilities in the search for capital.?
The successful VCs are asking the right questions to interrogate these companies thoroughly and filtering out the potential winners built on a solid foundation from those with a hollow shell that are ultimately destined to fail.?
Dr. Leonard Wossnig is the CTO of LabGenius, a next-generation antibody discovery company, where he leads a team of Data Scientists, Software Engineers, and Automation Experts to further enhance the company’s data-driven platform capabilities.
Before joining LabGenius, Leonard was CEO and co-founder of quantum machine learning company, Rahko. Post its 2021 acquisition by Odyssey Therapeutics, he became VP of Machine Learning, where he spearheaded the development of a computational platform for generative drug design in the areas of cancer and inflammation.?
Leonard is also an Honorary Research Fellow at the University College London’s Department of Computer Science.