ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

CEO Guide to Production ML

Venkata Pingali

Scribble Data | AI for Finance | Knowledge Agents | Co-Founder

å‘å¸ƒæ—¥æœŸ: 2020å¹´5æœˆ4æ—¥

Productionization or operationalization of Machine Learning is the process of making machine learning models run every day, reliably, and integrated into data products. Standalone/adhoc development of models is not compelling anymore (they donâ€™t make nearly as much business sense). â€œThe majority (85%) of respondent organizations are evaluating AI or using it in productionâ€ according to â€œAI adoption in the enterprise 2020â€. This is the trend across industries, geographies, and scales.

Productionization has proven to be more difficult than people expected. Very few ML models reach production stage. The main challenge is robustness. According to the same report, â€œWhether itâ€™s controlling for common risk factorsâ€”bias in model development, missing or poorly conditioned data, the tendency of models to degrade in productionâ€”or instantiating formal processes to promote data governance, adopters will have their work cut out for them as they work to establish reliable AI production lines.â€

ML Platforms

All serious companies are building ML platforms to build models reliably and scalably. Uber has Michelangelo, Stripe has RailYard, AirBnB has BigHead and Swiggy has DSP. In fact, Uber shared their motivation for Michelangelo: "there were no systems in place to build reliable, uniform, and reproducible pipelines for creating and managing training and prediction data at scale. Prior to Michelangelo, it was not possible to train models larger than what would fit on data scientistsâ€™ desktop machines, and there was neither a standard place to store the results of training experiments nor an easy way to compare one experiment to another. Most importantly, there was no established path to deploying a model into production".

Googleâ€™s engineers detailed out what these platforms achieve (Hidden Technical Debt in Machine Learning Systems) at around the same time (2017):

Distribution of effort in a production machine learning system

The structure of the problem remains the same, whether the model is simple or complex, and whether the development is happening in a small company or large. The structure has to do with the nature of the problems that arise when we go from adhoc modeling to continuous operation of the models. Platforms are being used to standardize and manage the process of development, deployment, and operation of machine learning models in order to achieve robustness and grow the use.

There is no standard platform design. Each organization is learning by doing. Uber shared the lessons learnt after multiple years of operation of their platform. A couple of them are:

Models need to be monitored: â€œmodel monitoring and instrumentation is a key component of real world machine learning solutionsâ€
Data is the hardest thing to get right: â€œdata engineers spend a considerable percentage of their time running extraction and transformation routines over datasetsâ€

MLOps (ML Operations)

As every organization is trying to build/buy its own Michelangelo, the way that ML models are developed is changing. MLOps - devops for ML - is the new framing and is growing in importance. The major sub-areas for MLOps are:

DevOps for Models - Develop and deploy models (e.g., Domino Data)
DevOps for Data - Preparing and monitoring data (e.g., Tecton)

To this end (#1 - DevOps for Models), some of Domino Data Labâ€™s workbench capabilities include:

Model development including A/B testing
Exploration of large datasets
Automatic tracking for reproducibility, reusability, and collaboration
Scalable compute and deployment management
Reports, dashboards, and API for model output
Deep integration with major compute platforms such as kubernetes and spark

And as for #2, DevOps for Data, Tecton.ai is the hottest new company in the space. Their platformâ€™s capabilities includes:

Feature Pipelines for transforming your raw data into features or labels
A Feature Store for storing historical feature and label data
A Feature Server for serving the latest feature values in production
An SDK for retrieving training data and manipulating feature pipelines
A Web UI for managing and tracking features, labels, and data sets
A Monitoring Engine for detecting data quality or drift issues and alerting

Because both Scribble and Tecton are informed by the design principles behind Uberâ€™s Michelangelo, there is a high overlap between the functionalities offered on both platforms though they operate at different scales.

ML Productionization Journey

Gartner, McKinsey and Others have articulated the challenges faced by organizations when they get on the ML journey. Here are a few recommendations for extracting business value from ML based on the industry consensus and our experience:

Owning the ML solution process and outcomes

Move to a new way of building systems. ML models and systems are probabilistic in design and operation. It is very different from software development with requirements. Internalizing the uncertainty in ML is critical for success.
Accept that ML is NOT magic. Making ML takes effort, often upfront. Increasing performance and accuracy is an iterative process requiring tools, experimentation, and revised processes.
Recognize new risks and opportunities. ML algorithms and data usability brings organizations into the purview of new privacy and algorithmic accountability laws directly and indirectly. It also enables companies to build new data products at a pace and with differentiation that wasn't possible before.

Setup for Success

Pick the right problems and approaches. A lot of time is wasted by pursuing problems that don't result in a good RoI for the organization or that cannot be realistically solved with existing data. Mature teams invest in good problem selection, evaluation metrics, development process, and integration into product. Experience matters in thipros case.
Build end-to-end discipline. ML is ultimately linear algebra or some other math. Correct operation of ML requires discipline in all phases of the lifecycle from planning and data collection to model operations. Organizations tend to narrowly focus on the model ignoring the rest. Even the modeling phase is chaotic. Developing and enforcing discipline is a must.
Design for learning. All ML models degrade over time (in fact, the degradation starts from the moment the training is over) and we learn over time what matters - data quality, corner cases etc. Continuous monitoring and improvement should be a core part of the design of any ML solution.

Providing the right infrastructure

Use tools for standardization and automation. ML development and operational processes are iterative, laborious and error-prone. Cutting time and effort at every phase through standardization, simplification, validation, and automation helps.
Provide checks and balances. The core value of ML is in the data and the algorithms. Risks to the organization include lost data, lost knowledge when staff leaves and decisions that can't be defended with clients/other stakeholders.. Tools that provide checks and balances during all phases of ML are critical to protecting the value created by ML for the organization.

The Journey

A sample journey could be as follows:

Phase 1 (1 usecase): Select and put basic infrastructure in place and identify one usecase. Design from get-go for continuous usage, along with data and process discipline. Achieve transparency (everyone knows what is happening), reproducibility (repeated execution), predictability (standardize outputs, locations, servers etc.), monitoring (notifications etc.), and consumption interfaces (APIs)
Phase 2 (2-10 usecases). Generalize standards and processes by adding new usecases and evolving the compute and process to scale. Also create reusable datasets, processes, and assets.
Phase 3 (10+ usecases). Separate out teams to focus on specific phases of the ML. Design APIs, integration mechanisms, monitoring mechanisms etc.

There is an active debate on build-vs-buy across the industries. For a long time there was a strong preference for build, especially on the infrastructure side. What organizations are learning over time is that:

The core value is in data ownership, good people, and end-to-end design. Organizations are therefore freely discussing their solution design with no fear of loss of competitive edge. They are using transparency to attract good talent.
Time is of the essence. Product development cycles are shrinking across the board. Organizations are stitching complex solutions with available resources, and not waiting for the perfect product or approach.
Infrastructure is very important but also expensive and time consuming. Few organizations have the budgets of Uber and Google. It is the new database. Organizations are reducing their build approach here over time.
Complex algorithms will not be easily built or bought. The algorithm that won the Netflix recommendation prize was not put into production due to RoI considerations. Simplicity and careful thinking is winning over complexity. New requirements of explainability are also pushing organizations in this direction. Again, staff and modeling approach is critical to this.

Summary

The best companies, at every scale, today have understood the need to have the right people, processes and mechanisms by which they can reliably find ML usecases, build models, and use them in production deployments every day.

A thought-through approach (more time spent sharpening axe than the actual chopping of wood) to the ML lifecycle will allow organizations that are getting on the ML journey to be that much more efficient, and to build serious value internally as well as for their end-customers.

This is part of a larger document. We will share the rest in future articles.

Dr. Venkata Pingali is the Co-Founder & CEO of Scribble Data, a ML Engineering product company. Scribble's flagship product Enrich implements MLOps for Data.

Senthil Nathan

Purpose-Driven AI Solutions for Enterprises

4 å¹´

Excellent post

èµž

å›žå¤

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Venkata Pingaliçš„æ›´å¤šæ–‡ç«

Robots Need Not Apply: Job Roles in Enterprise

2025å¹´3æœˆ13æ—¥

Robots Need Not Apply: Job Roles in Enterprise

[AuthenticHumanTM] Job related anxiety is real and growing. 90% of developers in the Harness 2025 State of Softwareâ€¦

1 æ¡è¯„è®º
Agentizing Business Process

2025å¹´3æœˆ6æ—¥

Agentizing Business Process

Feel the AI stones to cross the agentic river TL;DR Agentization of business processes has started Understandingâ€¦

2 æ¡è¯„è®º
Agent-Based Systems Have Arrived: AI Engineer Summit Online 2025

2025å¹´2æœˆ27æ—¥

Agent-Based Systems Have Arrived: AI Engineer Summit Online 2025

TL;DR: The AI Engineer Online Summit 2025 shows that AI agents are rapidly maturing. The talks had a strong sense ofâ€¦

10 æ¡è¯„è®º
Where will LLMs be in the Next 12 Months?

2025å¹´2æœˆ20æ—¥

Where will LLMs be in the Next 12 Months?

Benchmarks. Normally we like to think of technology development as an independent process dictated by markets.
Agents Will Take Over IT Service Management

2025å¹´2æœˆ13æ—¥

Agents Will Take Over IT Service Management

TL;DR ITSM economics is about to breakdown ITSM has a long tail of use cases because of complexity Agents will beâ€¦

1 æ¡è¯„è®º
[Feb 5] Implementation Experiences with Domain LLMs

2025å¹´2æœˆ5æ—¥

[Feb 5] Implementation Experiences with Domain LLMs

A lot of theoretical work is happening but delivering it to end customers is still a bit of challenge. This week weâ€¦
Post-Deepseek World

2025å¹´1æœˆ29æ—¥

Post-Deepseek World

Deepseek has reset priors of the tech community at large, and opened a much larger application game. Here is a mix ofâ€¦

4 æ¡è¯„è®º
Jan 24, 2025 - Knowledge Agents & Economics

2025å¹´1æœˆ24æ—¥

Jan 24, 2025 - Knowledge Agents & Economics

Welcome! In this edition we have two articles written by me and Rajesh on structure of knowledge agents, and economicsâ€¦
Alignment is Critical: What Iâ€™ve Learned About Leading a Cross-Border Startup

2024å¹´6æœˆ12æ—¥

Alignment is Critical: What Iâ€™ve Learned About Leading a Cross-Border Startup

Leading a cross-border organization has taught me that success depends on understanding and adapting to uniqueâ€¦

6 æ¡è¯„è®º
A Year to Remember

2020å¹´12æœˆ22æ—¥

A Year to Remember

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it wasâ€¦

5 æ¡è¯„è®º

See all articles

CEO Guide to Production ML

Venkata Pingali

Scribble Data | AI for Finance | Knowledge Agents | Co-Founder

ML Platforms

MLOps (ML Operations)

ML Productionization Journey

Owning the ML solution process and outcomes

Setup for Success

Providing the right infrastructure

The Journey

Summary

Venkata Pingaliçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Machine Learning & AI: When to Start?

State of the Machine Learning & AI* Industry

How to Get Started with TIR, the AI Platform, in Minutes

Revolutionizing Businesses with the Power of Machine Learning

Machine Learning: Transforming Data into Insights

Create Machine Learning Models Without Needing to Write Code

5 Common Machine Learning Problems & How to Solve Them

AI4You: Building Real-World AIâ€”From Prototyping to Scalable Solutions

Top 10 Machine Learning Companies to Watch Out (Updated for 2024)

How to Get Started with TIR, the AI Platform, in Minutes

ML Platforms

MLOps (ML Operations)

ML Productionization Journey

Owning the ML solution process and outcomes

Setup for Success

Providing the right infrastructure

The Journey

Summary

Venkata Pingaliçš„æ›´å¤šæ–‡ç«

Robots Need Not Apply: Job Roles in Enterprise

Agentizing Business Process

Agent-Based Systems Have Arrived: AI Engineer Summit Online 2025

Where will LLMs be in the Next 12 Months?

Agents Will Take Over IT Service Management

[Feb 5] Implementation Experiences with Domain LLMs

Post-Deepseek World

Jan 24, 2025 - Knowledge Agents & Economics

Alignment is Critical: What Iâ€™ve Learned About Leading a Cross-Border Startup

A Year to Remember

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Machine Learning & AI: When to Start?

State of the Machine Learning & AI* Industry

How to Get Started with TIR, the AI Platform, in Minutes

Revolutionizing Businesses with the Power of Machine Learning

Machine Learning: Transforming Data into Insights

Create Machine Learning Models Without Needing to Write Code

5 Common Machine Learning Problems & How to Solve Them

AI4You: Building Real-World AIâ€”From Prototyping to Scalable Solutions

Top 10 Machine Learning Companies to Watch Out (Updated for 2024)

How to Get Started with TIR, the AI Platform, in Minutes

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†