Pitfalls In Enterprise ML Strategy
Each BI strategy presentation talks about machine learning and actionable insights. It looks magical and exciting on slides, but ground reality is totally different. For many organizations, ML is coming down from peak of inflated expectations to dissatisfaction. In this article I will talk about four main pitfalls that corporate should avoid to reach target state- “data driven organization”.
Not Using ETL Tools
60% of efforts in ML are related to data preparation. Python is best language for data mining and exploration. This should be used during model building phase but should that same script deployed in production?. If your answer is “Yes” then please note this leads to complex mesh of data pipelines and soon it becomes unmanageable.
If right governance and ETL tooling is not used, then it creates tech debt. Organizations lose agility as they must spend more time in fixing data quality issues. This totally derails focusing on AI and ML.
Multiple Tools
I have seen many teams discussing about what tool to be used for which use case. Or which vendor product should be bought. Influence of vendors and preferences of tech communities within organization makes things worst. There are so many libraries, languages and vendor tools to achieve one single goal – “build models to help with predictions”. Many times, even experts fail to realize this. All it needs is selection of one single appropriate tooling. Other option is to build framework to bring it all together.
Model Lifecycle Management
Building model and training model can be done easily. Many times, It doesn’t take more than 20 lines of code. Complex part is deploying model, version controlling and tracking it’s performance. Integrating models with REST APIs or deploying it as a scoring engine are not solved cleanly even by tech savvy organizations. Many times, each new model goes through same learning process leading to longer time to market. Above point (Multiple Tooling) makes this problem highly complex. Simple solution for this is to build model deployment frameworks - I will write about it in next article.
Over Engineering
I have been part to many discussions where people debated on “how to train model in distributed fashion” or “using in-memory database to support training process”. More than 95% of ML use cases or many companies do not need this. If organizations are working with structured text data then final dataset for model training doesn’t go beyond few Gigabytes. If we are trying to solve high end problems like – Computer vision, image recognition, audio tagging then that might need more resources. Here focus is diverted from solving business requirement to an arbitrary technical requirement.
Helping people make better decisions
3 年Excellent points Sudhir Jangam. This is definite learning for me. Looking forward to your article on model deployment frameworks