Ml models: From Jupyter notebook to a Database
(Image credit: kdnuggets.com)
Machine learning model in a Jupyter notebook is not the end of a data science project. Taking the machine learning model from a Jupyter notebook, refactoring the code for production, versioning, unit testing, logging to keep track of hyperparameters and model metrics, packaging the model & pipeline, establishing a CI/CD pipeline for automation, containerization, registering the model, creating an endpoint for model inference, monitoring the model for drift, retraining the model to keep it relevant is not an easy task.
As the industry matures, new tools are coming up to ease the process of deployment and inference. Mlflow is one such tool. Recently, I came across a research paper which is trying to ease the process of model inference along with data governance. It is futuristic, strange and radical. It is project Raven by Microsoft.
In the future, machine learning will be ubiquitous. As machine learning will be adopted by big and small enterprises, the concerns for data privacy, governance and security will also become very stringent. A data scientist downloading the sensitive patient data on a laptop and predicting his or her treatment won't cut it. How project Raven will help?
Let's talk about RDBMS before jumping into project Raven.
RDBMS with over 30 years of development under its belt is very mature. RDBMS provides robust security, governance, auditing and data provenance for the enterprises. Project Raven's radical idea is to store data preparation pipelines along with machine learning models in a database.
Storing pipelines and machine learning models will help in inference and data governance. Combining database technology with machine learning inference is not trivial. Databases use relational algebra (set theory) to function. Machine learning uses linear algebra. In the research paper by Microsoft, more than 4 million open GitHub repos were analyzed. It was found that 83% of the machine learning algorithms can be tackled by using linear algebra. Raven has to deal with relational as well as linear algebra along with user defined functions (for the 13% ml algorithms which do not depend on linear algebra) to provide machine learning inference from a database.
Bird's eye view of how Raven will function - it consists of SQL Provenance module, the catalogue and Python Provenance module. The catalogue functions as a bridge between SQL and Python Provenance modules. A data analyst fires a SQL query to find if a loan should be given to a customer. SQL provenance module parses the query to see which tables and columns are required and passes the information to the catalogue. Python provenance module will parse the query (or python script) to see what transformations are performed, how the model was trained etc. using its knowledge base. Then Python provenance will use the information from the catalogue (provided by SQL provenance) to provide the inference if a loan should be provided or not. It was observed that there is an improvement in latency for providing inference when pipelines and models are saved in a database.
We already know that databases provide enterprise-level security, access control, auditing and governance. Cloud-based databases also provide horizontal scaling. Storing transformation pipelines and models also in a DBMS will extend the benefits of a database to the machine learning arena.
Microsoft aptly named their research paper, "Cloudy with High chances of DBMS". Cloudy because most of the machine learning training is going to happen on the cloud and with a high probability, the trained model will be saved in a database management system. Raven is a baby step in the right direction. Hoping we will reach there fast..godspeed!
Links:-
Cloudy with a high chance of DBMS: a 10-year prediction for enterprise-grade ML
https://cidrdb.org/cidr2020/papers/p24-karanasos-cidr20.pdf
https://martinfowler.com/articles/cd4ml.html
Data Engineer at Mercedes Benz
4 年Very interesting article and has given me one reason to learn relational algebra again. :)
Director, Product Implementation & Services
4 年Great article Thulasiram Gunipati ! While the technologies and the terminology used here will evolve depending on which 'suite' we use, everyone is going to adopt this template for automating ML pipelines.
Director - Analytics | Content Planning & Strategy | Media | Entertainment | OTT | Big Data Analytics | Data Science
4 年Very well written and insightful Thulasiram Gunipati