Data Management for AI
AI’s use in decision-making is ubiquitous today. The auto-pilot program that allows you to change lanes without user intervention on a Tesla, potentially suspicious money laundering transactions on a banking platform and approvals by the FDA on clinical trials for a vaccine all rely on the quality of data fed to train their machine learning models and subsequently the real-world data that these models use to predict in decision making.
Whether the organization uses external data or sources it internally, good quality data is never a given. Even today’s organizations struggle to ensure the right data is used for the right purpose with adequate quality.
Traditional data management
Previously the data life-cycle catered to manual decision-making mostly via BI reports. A typical process to manage data operations for traditional analytical needs is shown below.
But today’s companies are increasingly held accountable by laws, rules, regulations and by internal stakeholders to ensure that insights from AI that result in regulatory compliance and other business decisions, are audit-able for source data quality and overall efficacy.
Challenges
Given this scenario of increasing importance of data quality and availability, companies have to rethink the way data is managed across the organization. Traditionally, data management has catered to Business Intelligence (BI) that contrasts to today’s world of Big Data and machine learning at scale.
Some of the key challenges forcing companies to rethink data management are
- Reliability and consistency of source data
- Hard to explain models (e.g. CNNs)
- Requirements to keep lineage information
- Population data and model validation
- Data Privacy
Companies now are faced not just with data governance but also with model governance. Model Governance as defined by the Open Risk Manual is the name for the overall internal framework of a firm or organization that controls the processes for model development, validation and usage, assign responsibilities and roles etc.
Although model governance seems on the surface as a separate process, it also impacts the data life-cycle and should be thought as one of the requirements going into operations. Why? This is because models do not exist in isolation from the data they were produced from and are acted upon. In fact good data governance goes hand in hand with good model governance with a lot of overlap between both processes.
Towards better data management for AI
So how do we refactor our existing data life-cycle to include model governance? A good approach should consider various internal factors but overall the key is to include the AI and model governance requirements upfront so that in each agile cycle any changes to the business requirements also go through a model governance update. A simplified high-level version of the new process could look like the one shown below.
There are other aspects that need to be considered like personnel and the integration points between both processes which I haven't delved in here for simplicity sake. These can vary in scale and complexity for the type of organization, maturity etc.
Data management for AI is getting increasingly complex given the myriad of technologies and applications in today’s world. It is important that companies manage data pipelines for AI in a formal fashion that could be better maintained to ensure long-term success.