登录查看更多内容

Introducing Evidently 0.0.1 Release: Open-Source Tool To Analyze Data Drift

Emeli Dral

Co-founder and CTO Evidently AI | Machine Learning Instructor w/100K+ students

发布日期: 2020年12月4日

We are excited to announce our first release. You can now use Evidently open-source python package to estimate and explore data drift for machine learning models.

It helps you quickly understand: did my data change, and if yes, where?

How does it work?

As an interactive report right in the Jupiter notebook.

You need to prepare two datasets. One is a reference: we will use it as a baseline for comparison. Pick something you consider a good example, and where your model performed reliably. It can be your training data. Or, production data from some past period.

The second dataset is the most recent production data you want to evaluate.

Import your data as a Pandas DataFrame. You can have two data frames, or a single one where you explicitly select which rows belong to the reference, and which to the production data.

Then, you can use Evidently to generate an interactive report like this:

We show the drifting features first, sorting them by P-value. Using a statistical test, we make a drift/no drift decision for each feature individually.

You might want to explore them all or look into your key drivers.

By clicking on each feature, you can explore the values mapped in a plot. The green area covers one standard deviation from the mean, as seen in the reference dataset.

Or, you can zoom on distributions to understand what has changed:

Why is it important?

We wrote a whole blog about Data and Concept Drift. In short, things change, and this can break your models. Detecting this is key to maintaining good performance.

If there are data quality issues, our tool will also pick this up. When your data goes missing or features break, this usually shows in data distributions. We will soon add more fun reports to explore features and analyze data quality. But this one can already serve as a proxy.

What is cool about it?

We implemented the statistical tests for you, so you don't need to think them through. We know these are quite cumbersome to write, and there is quite some chance to mess it up. Solved.

We use a two-sample Kolmogorov-Smirnov test for numerical features and the chi-squared test for categorical features, both at 0.95 confidence level. We will add some levers later on, but this is a good enough default approach.

The visuals are helpful, and would otherwise take considerable time to code in Plotly or Matplotlib. Here, each feature gets an interactive plot you can explore to understand its behavior.

What's more, you can share this report around as a .html file. If you ever had a back-and-forth exchange of screenshots with another department, you will like this one:

An email to the marketing department with the report attached

Finally, it is dead simple to install and use. No new tool to learn, no service to maintain. Just open your notebook and try it out!

When should I use it?

Of course, when your model is in production. But also before.

Here are a few ideas on how you can use the data drift tool:

Support your model maintenance. Understand when it is time to retrain your model, or which features to drop when they are too volatile.
Before acting on model predictions. Validate that your input data is from the same distribution, and you are not feeding anything outrageously different into your model.
When debugging model decay. If the model quality dropped, use the tool to explore where the change comes from.
In A/B test or trial use. Detect training-serving skew and get better context to interpret test results.
Before deployment. Understand drift in the offline environment. Explore past shifts in the data to define your future retraining needs and monitoring strategy.
To find useful features when building a model. Get creative: you can also use the tool to compare feature distributions in your positive and negative class. This will quickly surface the best discriminants.

How can I try it?

Go to Github, read the docs and explore the tool in action using sample notebooks. We have a demo with the eternal Iris dataset, Boston housing dataset, or breast cancer dataset.

---

If you have any questions or thoughts, write to us on [email protected]. That's an early release, so send any bugs that way, or open an issue on Github.

Want to stay in the loop?

Sign up to receive our news and product updates.
Join on Twitter and Linkedin for more content on production machine learning.

Introducing Evidently 0.0.1 Release: Open-Source Tool To Analyze Data Drift

Emeli Dral

Co-founder and CTO Evidently AI | Machine Learning Instructor w/100K+ students

How does it work?

Why is it important?

What is cool about it?

When should I use it?

How can I try it?

更多精彩文章

社区洞察

其他会员也浏览了

K-Nearest-Neighbor-IMDB-Project

Pandas in Multidimensional Magic: Navigating Arrays from 2D to 5D

Pandas in Multidimensional Magic: Navigating Arrays from 2D to 5D

Categorical variables_What are they & How to handle?

DATA PREPROCESSING

Exploratory Data Analysis on Diamonds Dataset

Pandas:A Swiss knife for EDA

A Trial and Error of an Inaccurate Enough Tabular Data Classification with AutoML

Analyzing Missingness

Build a User Interface for your Machine Learning Model

How does it work?

Why is it important?

What is cool about it?

When should I use it?

How can I try it?

How to detect, evaluate and visualize historical drifts in the data

2021年8月10日

To retrain, or not to retrain? Let's get analytical about ML model updates

2021年6月23日

What Is Your Model Hiding? A Tutorial on Evaluating ML Models

2021年4月22日

New Release: Performance Reports for Classification Models in Production

2021年3月2日

How to break a model in 20 days. A tutorial on production model analytics

2021年2月18日

New Release: How To Analyze The Performance of Regression Models in Production?

2021年2月5日

New Release: Analyze Target and Prediction Drift in Machine Learning Models

2020年12月30日

Machine Learning Monitoring, Part 4. How To Track Data Quality and Data Integrity

2020年10月29日

Machine Learning Monitoring, Part 3: What Can Go Wrong With Your Data?

2020年9月11日

社区洞察

其他会员也浏览了

K-Nearest-Neighbor-IMDB-Project

Pandas in Multidimensional Magic: Navigating Arrays from 2D to 5D

Pandas in Multidimensional Magic: Navigating Arrays from 2D to 5D

Categorical variables_What are they & How to handle?

DATA PREPROCESSING

Exploratory Data Analysis on Diamonds Dataset

Pandas:A Swiss knife for EDA

A Trial and Error of an Inaccurate Enough Tabular Data Classification with AutoML

Analyzing Missingness

Build a User Interface for your Machine Learning Model