登录查看更多内容

Ml models: From Jupyter notebook to a Database

Thulasiram Gunipati

Data Engineering @ldc

发布日期: 2020年5月1日

+ 关注

(Image credit: kdnuggets.com)

Machine learning model in a Jupyter notebook is not the end of a data science project. Taking the machine learning model from a Jupyter notebook, refactoring the code for production, versioning, unit testing, logging to keep track of hyperparameters and model metrics, packaging the model & pipeline, establishing a CI/CD pipeline for automation, containerization, registering the model, creating an endpoint for model inference, monitoring the model for drift, retraining the model to keep it relevant is not an easy task.

As the industry matures, new tools are coming up to ease the process of deployment and inference. Mlflow is one such tool. Recently, I came across a research paper which is trying to ease the process of model inference along with data governance. It is futuristic, strange and radical. It is project Raven by Microsoft.

In the future, machine learning will be ubiquitous. As machine learning will be adopted by big and small enterprises, the concerns for data privacy, governance and security will also become very stringent. A data scientist downloading the sensitive patient data on a laptop and predicting his or her treatment won't cut it. How project Raven will help?

Let's talk about RDBMS before jumping into project Raven.

RDBMS with over 30 years of development under its belt is very mature. RDBMS provides robust security, governance, auditing and data provenance for the enterprises. Project Raven's radical idea is to store data preparation pipelines along with machine learning models in a database.

Storing pipelines and machine learning models will help in inference and data governance. Combining database technology with machine learning inference is not trivial. Databases use relational algebra (set theory) to function. Machine learning uses linear algebra. In the research paper by Microsoft, more than 4 million open GitHub repos were analyzed. It was found that 83% of the machine learning algorithms can be tackled by using linear algebra. Raven has to deal with relational as well as linear algebra along with user defined functions (for the 13% ml algorithms which do not depend on linear algebra) to provide machine learning inference from a database.

Bird's eye view of how Raven will function - it consists of SQL Provenance module, the catalogue and Python Provenance module. The catalogue functions as a bridge between SQL and Python Provenance modules. A data analyst fires a SQL query to find if a loan should be given to a customer. SQL provenance module parses the query to see which tables and columns are required and passes the information to the catalogue. Python provenance module will parse the query (or python script) to see what transformations are performed, how the model was trained etc. using its knowledge base. Then Python provenance will use the information from the catalogue (provided by SQL provenance) to provide the inference if a loan should be provided or not. It was observed that there is an improvement in latency for providing inference when pipelines and models are saved in a database.

We already know that databases provide enterprise-level security, access control, auditing and governance. Cloud-based databases also provide horizontal scaling. Storing transformation pipelines and models also in a DBMS will extend the benefits of a database to the machine learning arena.

Microsoft aptly named their research paper, "Cloudy with High chances of DBMS". Cloudy because most of the machine learning training is going to happen on the cloud and with a high probability, the trained model will be saved in a database management system. Raven is a baby step in the right direction. Hoping we will reach there fast..godspeed!

Links:-

Cloudy with a high chance of DBMS: a 10-year prediction for enterprise-grade ML

https://cidrdb.org/cidr2020/papers/p24-karanasos-cidr20.pdf

https://martinfowler.com/articles/cd4ml.html

Subhayan Ghosh

Data Engineer at Mercedes Benz

4 年

Very interesting article and has given me one reason to learn relational algebra again. :)

Vijaya Cheedaramalli

Director, Product Implementation & Services

4 年

Great article Thulasiram Gunipati ! While the technologies and the terminology used here will evolve depending on which 'suite' we use, everyone is going to adopt this template for automating ML pipelines.

2 次回应

Anurag Halder

4 年

Very well written and insightful Thulasiram Gunipati

1 次回应

查看更多评论

要查看或添加评论，请登录

Thulasiram Gunipati的更多文章

New Year Resolution - Improve my Coding Skills

2022年1月5日

New Year Resolution - Improve my Coding Skills

My coding needs improvement. I changed my industry and took up data science years back.

4 条评论
Notes - Decision Trees, Random Forests, Bagging, Boosting (AdaBoost, XGBoost), Stacking

2020年9月13日

Notes - Decision Trees, Random Forests, Bagging, Boosting (AdaBoost, XGBoost), Stacking

Decision Trees · They require very little data preparation · Don’t require feature scaling or centering at all · Gini…

1 条评论
What is a DMP and How to create it?

2020年5月24日

What is a DMP and How to create it?

"There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information…

17 条评论
Anomaly Detection and Interpretability 2020

2020年5月10日

Anomaly Detection and Interpretability 2020

"There is no wealth like knowledge, and no poverty like ignorance" - Buddha "Share your knowledge. It's a way to…

1 条评论
Data Lakehouse – Powerhouse of the future

2020年3月7日

Data Lakehouse – Powerhouse of the future

Databricks recently published a blog post on Data lakehouse. As soon as this blog was published, the word ''lakehouse"…
Druid - Fast analytics on Batch & Real-time data

2020年3月1日

Druid - Fast analytics on Batch & Real-time data

Apache Druid Druid is an analytics database for low latency queries with the sub-second response time. It is a system…
Business Intelligence and Data Reflections

2020年2月21日

Business Intelligence and Data Reflections

Business intelligence (BI) tools need costly extracts or data ingestion to the cubes. Presently enterprises are…

2 条评论
Must-know Machine Learning Questions – Logistic Regression

2018年7月5日

Must-know Machine Learning Questions – Logistic Regression

Welcome to the second part of the series of commonly asked interview questions based on machine learning algorithms. We…
Converting Business Problems to Data Science Problems

2018年7月2日

Converting Business Problems to Data Science Problems

In a lot of Data Science interviews, it is common to ask business related questions. The interviewee is expected to…
Must-know Machine Learning Interview Questions – Linear Regression

2018年6月29日

Must-know Machine Learning Interview Questions – Linear Regression

It is a common practice to test data science aspirants on commonly used machine learning algorithms in interviews…

3 条评论

See all articles

Ml models: From Jupyter notebook to a Database

Thulasiram Gunipati

Data Engineering @ldc

Thulasiram Gunipati的更多文章

社区洞察

其他会员也浏览了

Mastering Data Science From Basics to Advanced

Tools of Data Science: Empowering Insights and Innovation

DATA Pill #054 - 10 best open-source repos, LLM, Flink and Apache Iceberg + Python

Fast Kullback-Leibler Divergence Using Spark

A Comprehensive Guide to CSV Files vs. Parquet Files in PySpark

From Analysts to Data Scientists

10 Best Data Science Questions for Beginners

How to Transition into Data Science: A Three-Step Approach

5.5 Tips on Starting a Career in Data Science

5 Books Every Data Professional Should?Read

Thulasiram Gunipati的更多文章

New Year Resolution - Improve my Coding Skills

Notes - Decision Trees, Random Forests, Bagging, Boosting (AdaBoost, XGBoost), Stacking

What is a DMP and How to create it?

Anomaly Detection and Interpretability 2020

Data Lakehouse – Powerhouse of the future

Druid - Fast analytics on Batch & Real-time data

Business Intelligence and Data Reflections

Must-know Machine Learning Questions – Logistic Regression

Converting Business Problems to Data Science Problems

Must-know Machine Learning Interview Questions – Linear Regression

社区洞察

其他会员也浏览了

Mastering Data Science From Basics to Advanced

Tools of Data Science: Empowering Insights and Innovation

DATA Pill #054 - 10 best open-source repos, LLM, Flink and Apache Iceberg + Python

Fast Kullback-Leibler Divergence Using Spark

A Comprehensive Guide to CSV Files vs. Parquet Files in PySpark

From Analysts to Data Scientists

10 Best Data Science Questions for Beginners

How to Transition into Data Science: A Three-Step Approach

5.5 Tips on Starting a Career in Data Science

5 Books Every Data Professional Should?Read