Digital Data Strategy and Analytics - A Quality Perspective
Image by rawpixel.com

Digital Data Strategy and Analytics - A Quality Perspective

An enterprise data strategy including a data governance maturity roadmap is a key component of digital transformation. Data is now seen as an asset that needs to be protected, curated, and mined for insights to not only improve operational margins, but also become a source of competitive advantage. From a small startup to global behemoths, ‘becoming data driven’ is a key strategic goal.

Industry studies show that implementing an effective data strategy as part of the digital transformation program has not been easy for many companies. According to this HBR article,

"Cross-industry studies show that on average, less than half of an organization’s structured data is actively used in making decisions—and less than 1% of its unstructured data is analyzed or used at all. More than 70% of employees have access to data they should not, and 80% of analysts’ time is spent simply discovering and preparing data. Data breaches are common, rogue data sets propagate in silos, and companies’ data technology often isn’t up to the demands put on it.”

 In fact, cloud transformation can be better understood by its linkage with data transformation. While a successful data transformation program could empower the organization at all levels, and drive innovation and growth, the cloud strategy delivers the computing power, storage capacity and analytical capabilities that power this transformation.

Data strategy should therefore be seen as a critical input and driver of a digital transformation. However, due to multiple organizational (and preventable) reasons, one could observe situations where the data transformation program is begun later than the cloud migration program. This is a risky scenario because the specific needs unearthed by the data transformation/digitalization program may differ at multiple levels that of lift and shift migrations.

To ensure the goals of the data transformation receive the right level of visibility and priority, program and risk managers need to consider establishing a tube map for the governance and management review of the transformation program trains, where data transformation is a specific tube line, with dependencies on the cloud adoption milestones.

(Need some help on creating a tube map? Use diagrams.net a free tool. Link to a helpful blog.)

While the intention of this article is not to go into the details of a data strategy, it is important to briefly list the key elements and identify aspects of relevance from a quality, risk and compliance perspective. In other words, what a Quality professional should look for, and provide consulting and oversight.

Firstly, what is data strategy? The HBR article says it is “a coherent strategy for organizing, governing, analyzing, and deploying an organization’s information assets”. Data strategy is implemented via a standardized (reference) set of methods, architectures, usage patterns, services, tools and procedures for collecting, storing, securing, curating/organizing, managing, monitoring, analyzing, consuming, operationalizing and monetizing of data.

 Key components of a data strategy with relevance for Quality, Risk and Compliance professionals:

List of Data Strategy components and Controls.

There are some early successes of pilot projects, but Life Sciences and Health Care organizations expect it to take between 3-5 years to start realizing ‘value’ out of the AI/ML enabled analytics. Organizations are however acutely conscious that they can’t afford to wait for the technology to mature and then venture into the usage. It is definitely not a ‘winner takes it all’ situation, but unless one starts investing in AI-driven innovation now, it may be too late to catch-up and (re)gain competitive advantage.

Companies are thereby adopting a two-pronged approach. The first prong is to get control over the data lifecycle and build a data analytics platform, as part of the enterprise data strategy. The second prong is to begin small-scale projects to build MVPs using the data already structured and available for analytics in the cloud.

The players in this eco-system – a. The Cloud Service Providers with their Data Governance and Analytics services b. The global industry leaders (Pharma majors in case of Life Sciences for example) and c. The specialist vendors of AI software platforms/services and the consulting services + system integrators with their own service offerings to help the industry leaders implement a data transformation program.

Even small and medium size companies are striving to implement a data digitalization program, but the big investments needed and the long gestation period for commercialization is making them prioritize investments in niche application areas (AI in medical imaging, pharmacovigilance, manufacturing analytics etc.) at a smaller scale.

Due to the nature of the technology (evolving rapidly and requiring highly skilled data scientists and neural networking programmers), the high initial investment required (justified by the promise of huge returns) and the gestation period (there will be misses and wins along the journey), none of the companies want to try and do it on their own. The CSPs are the ideal primary partners in this journey with a huge ‘skin in the game’, but the remaining pie is still big enough for all the IT consulting services companies to rollout digital transformation and AI/ML offerings.

Microsoft has just announced the launch of Azure Purview, which is a unified data governance service that helps to manage/govern data from your on-premises, hybrid/multi cloud and SaaS workloads. At the core of Microsoft’s reimagined approach to data governance and analytics lies Azure Synapse Analytics – which is a limitless analytics service that unifies data integration, data warehousing, and big data analytics.

AWS with its DataLake and Analytics services such as Kinesis, Sagemaker, Amazon EMR etc. offers a purpose-built approach to data management and predictive analytics. If you are in the mood for comparison, Google offers roughly similar services through SmartAnalytics (BigQuery, DataFusion, Data Catalog etc.). Google offers its native Data Loss Prevention (DLP) service (typically used in sensitive data de-identification use case) which one could customize, whereas both Azure and AWS rely for this purpose on third party tools/services available on their marketplaces.

The top CSPs are forming alliances with industry-leaders to explore AI/ML use cases addressing the fundamental challenges in each industry. For example, Novartis which has 2 million years of patient data has launched an AI Innovation Lab and Data42 – a digital R&D platform.

In the words of Novartis’ Chief Digital Officer Bertrand Bodson:

“We are now able to bring to bear the predictive powers of AI against this massive reservoir of data. We can use those predictive powers to enhance the molecules we’re creating and work with because we can now more precisely probe biological systems. And that can lead to breakthroughs in vital areas such as smart dosing.”

While industry leaders are optimistic about the disruptive power of AI, a digital transformation with an effective data strategy is easier said than done. If you are a Quality professional and your company is thinking of or implementing a digital project (enterprise-wide or pilot), you would want to establish the approach to maintain the analytics solutions in a state of continuous validation. 

The approaches for infrastructure qualification and software validation for traditional on-premises and cloud-based solutions are getting harmonized due to tool chain adoption. (We will deal with that topic in the next post.)

The question to be answered - how does one ‘validate’ an application/service that runs on a machine learning algorithm? To begin with, understanding the data is the key determinant of your testing strategy.

As usual, one would ensure a risk evaluation is performed based on the following indicative questions:

  1. What is the business use case, that is, which data is being analyzed
  2. What is the nature of the expected analytical insight and
  3. Will the use of insights in decision making impact patient safety

For example, if the purpose of the analytics is to deliver insights that lead to operational efficiencies at the manufacturing plant or drug demand management or improved recruitment into clinical trials, the risk (impact of failure) is deemed less. If the insights from AI-driven solutions are intended to directly influence the diagnosis or treatment of a patient (for example, review of medical images to aid diagnosis), the risk can be deemed higher. A higher risk simply means one should expect a high degree of accuracy from the algorithm before putting it into production usage. Medico-legal considerations come into play as well when the AI-driven decision making seeks to replace or supplement human intervention.

How does one determine an acceptable level of accuracy? In scenarios such as processing of medical images and review, the approach is to treat target accuracy % as the level of accuracy when humans perform the same task. It is important for the validation team to establish a scientific basis for target accuracy % in production usage based on the use case.

Now comes the actual process of building and testing the ML model (supervised or unsupervised learning) to reach the desired accuracy level. The current practice is to use three types of data sets – a training data set, a dev-validation data set and a (final) test data set. The overall available data set is randomized and then divided into training (65%), development (20%) and test (15%) datasets.

The training data set is fed into the program to allow the ‘neural network model’ to analyze it, and to train itself. One could train different models using the same training data set. The models are then tested using the dev set, and tweaked for better results based on the output accuracy or error rate. This is an iterative exercise and repeated till the models arrive at their optimal level of accuracy for the given datasets.

Once all the models are evaluated, the best performing models are then exposed to the test data to confirm the results of the dev set. This iterative testing cycle will need to continue in production too.

The challenge is not over yet. It has been observed across multiple projects that models tweaked to near perfection in the lab fail to give equivalent performance in the real world (production usage). Data shift and Under specification are considered the main reasons for these failures.

Data shift is the mismatch or delta between the data used to train and test the AI program and the data it receives in production. This risk can only be mitigated/reduced by creating a training data set that covers all the variables that one is aware of. For example, if you want to train your program to read and analyze medical images, and your training data contains images that are properly processed and clean, the program will struggle if a user uploads a low resolution or blurred image of taken by a mobile camera.

Under specification is a term used in statistics and linguistics, and per a Google study, it is one of the chief causes for failures in AI/ML models. The simplest way to describe it is to think in terms of Goodhart’s Law – “when a metric becomes a goal, it ceases to be a good metric.”

If you do not wish to download and read the very technical Google report, you can read the more lucid MIT Technology Review article on the same topic.

One could feel unnerved by the steep learning curve and the challenge of building group competencies within the organization to work collaboratively on addressing the risks, and realizing value. However, if one looks closely, testing an AI/ML solution is not drastically different from what we have learned to do with transactional and analytical applications. The principles are the same. Just the tools and the scale are different. And it needs to drive a continuous loop of improvement.

To conclude, quality and assurance functions have a challenge and a huge opportunity to contribute to the success of digital transformation of data and especially, in the evolving area of AI/ML systems validation. Standards and processes need to be established for creation and management of training and test data sets, and a ‘data culture’ should be evangelized across the organization and external partners. A typical project team could be a combination of in-house domain and data experts, and SMEs from the CSPs, third party product or system integration partners. The quality approach and goals have to be internalized among all stakeholders. This is by no means impossible, and is a requirement for any large-scale transformation project anyway. However, the high hopes on AI-driven analytics to help improve outcomes, and the rapidly evolving nature of the discipline itself, makes this the right moment for individuals to fashion themselves into Quality Program Managers or Data Analytics Assurance Consultants on such projects.

要查看或添加评论,请登录

Narasimha Kumar的更多文章

社区洞察

其他会员也浏览了