登录查看更多内容

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Abhishek Kori

Product Manager | Wholesale Lending, Commercial Real estate | Skills: Product, Project Management, UI/UX, Programming, Data Analysis, Data Governance, Visualization, Communication | PSPO?

发布日期: 2020年3月13日

CRISP-DM is a proven process to carry out data mining. According to Wikipedia, it originated in 1996 became a European Union project under the ESPRIT funding initiative in 1997. I came across this topic when I was reading this book “Data Science for Business”, which really caught my attention. Let’s see what are the stages and outcomes.

Kenneth Jensen / CC BY-SA (https://creativecommons.org/licenses/by-sa/3.0)

As the diagram depicts, its a cyclic process and it can take many rounds of iteration before finalizing the outcome. Going through the full cycle and not having to solve the intended business problem is not a failure. Often times, in each cycle, the team learns new facts about the data or business and generate new ideas from the iteration.

Business Understanding

Image by StartupStockPhotos from Pixabay

Projects rarely come with pre-defined well understood data mining problem. The diagram has cycles within the main cycle, where teams go back and forth between the different stages of the data mining process, as initial understanding may not be the best or complete one.

In this stage, business-analysts formulate one or more data science problems from the business requirement. This will lead to dividing the main problem into subproblems involving building models for classification, regression, probability estimation etc. The team should try to answer questions like “What exactly do we want to do? How exactly would we do it? What parts of this use scenario constitute possible data mining models?”. While working in subsequent stages, the team may come back to this stage and modify the use case scenario to better reflect the business problem.

Data Understanding

Image by Pexels from Pixabay

The data which the team collects will not exactly match the business need. In historic data collection process, the data may have been collected with a different intention in mind. Data collected may be in a different format or size and come with variety. Even the accuracy of the data may not be precise for the current business problem.

Collecting data comes at a cost. Parts of it may be freely available, some of it may be needed to be purchased and some of it may require different projects or team that needs to be set up to collect the required data.

An important aspect of Data understanding phase is the costs and benefits of each data source and whether the investment is justified.

Data Preparation

Image by Werner Heiber from Pixabay

The data which is available for further processing or building model will rarely be in a format which the ML algorithms or tools understand. It needs to be converted and formatted to yield better results. Some of the data preparation includes converting data to a tabular format, filling in missing values or removing the data point which contains a missing value, converting the data into a different data format etc. In addition to cleaning and formatting, data also needs to be normalized or scaled.

One important consideration that needs to be made is about “leaks”. A leak is when a variable available in historical data which give information about the target label may not be available during the time of decision making. This variable may occur or generate after the target event has taken place. It's paramount to make sure all the variables will be available during the decision-making process in the production setting.

Modelling

Image by Lubos Houska from Pixabay

This is a relatively small and easy phase of the data mining process compare to other stages. In this stage, data-scientists build a predictive model with data available from “data preparation” stage. There are a number of tools and frameworks available which takes in the data and labels predicts the outcome of unknown input in case of supervised learning or transforms the data in case of unsupervised learning. This job is a deep dive subject on its own, where it involves a lot of statistics, probability and calculus. From a high-level business process point of view, there is not much add and this stage is more of a Data Scientist expertise.

Evaluation

Image by Michal Jarmoluk from Pixabay

This process involves testing the result of the data mining model generated from the previous step. This process is rigorous and plenty of scrutinies are required to test the model in rigour. We need to have confidence that the patterns emerging out of the model are truly regular patters and not a “once a while” random instance.

We should also make sure that the model which has been generated serves the original business goal. Often the model is evaluated in a controlled lab setting. These tests are conducted to simulate the real world, as close as possible. Even though the results from these tests pass the scrutiny, there may be external factors which make it impractical. In the case of fraud detection, the model might be 99% accurate in a lab setting but in real life, it might trigger a lot of false positive. These false positives might lead to huge economic losses and customer dissatisfaction.

In an industry setting, there are a lot of assessment and evaluations which are done at different levels of management. Before a model goes into production it needs to satisfy different stakeholder and make them understand how the model works. If the model is too complicated or incomprehensible to the stakeholder, the team might not get “sign-off” from management to go ahead deploy it to production.

Deployment

Image by WikiImages from Pixabay

Most of the time “data mining techniques” themselves are deployed. Steps like collecting, processing and formatting the data might also be added to the deployment pipeline. As the author says

“deploying the data mining system itself rather than the models produced by a data mining system are (i) the world may change faster than the data science team can adapt, as with fraud and intrusion detection

(ii) a business has too many modelling tasks for their data science team to manually curate each model individually”

Deploying the model in production includes re-coding the model to suit production quality rubric and match the speed of production systems. Data scientists usually build models which are working prototypes and may not be in the best shape to be deployed in production. This handing over the final model to software engineers to rebuild with production quality is risky. It is helpful to remember “Your model is not what the data scientists design, it’s what the engineers build”. It's really important for the development team members involved early in the data science project. Initially, they can act as advisors to the data science teams about the production systems. As the project progresses after deployment these engineers will take ownership of the product.

Final thoughts

Regardless of whether the deployment was successful or not, these prosses will lead to a greater understanding of the business problem and engineering solutions. Subsequent iterations may lead to better outcomes. Just the process of thinking about the business need, the data and the performance goals will give rise to new business ideas.

This blog has been inspired by the book “Data Science for Business”. It’s a great read with lots of case scenario, tips and examples.

Let me know your thoughts about this whole data mining development cycle, looking forward to hearing from you.

要查看或添加评论，请登录

Abhishek Kori的更多文章

Feature prioritization: Seven questions to ask yourself

2020年10月18日

Feature prioritization: Seven questions to ask yourself

Originally published on productshek.com In an early-stage startup, as a founder/product manager, you have to always…
Summary: The Root Causes of Failed Product Efforts - Inspired Book

2020年7月2日

Summary: The Root Causes of Failed Product Efforts - Inspired Book

Introduction I have written a full blog post on my website productshek .com Inspired: How to Create Tech Products…

1 条评论
Rock Paper Scissor classifier using fast AI and restnet 34 pretrained model

2019年8月18日

Rock Paper Scissor classifier using fast AI and restnet 34 pretrained model

Rock Paper Scissors image classifier using Fast AI library We will be executing the below code in…

4 条评论
ML based sentiment analysis of movie reviews

2018年10月20日

ML based sentiment analysis of movie reviews

Today lets learn about sentiment analysis of movie reviews. We will be training Naive Bayes classifier which is binary…

2 条评论
Dead simple heroku flask app

2018年7月14日

Dead simple heroku flask app

Hi guys, it has been long time i built any app and host it on heroku. Today i was just trying to build a simple flask…

4 条评论

See all articles

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Abhishek Kori

Product Manager | Wholesale Lending, Commercial Real estate | Skills: Product, Project Management, UI/UX, Programming, Data Analysis, Data Governance, Visualization, Communication | PSPO?

Business Understanding

Data Understanding

Data Preparation

Modelling

Evaluation

Deployment

Final thoughts

Abhishek Kori的更多文章

社区洞察

其他会员也浏览了

Data Mining

Selecting the Right Data Mining Service Provider for Maximum Impact

DATA MINING PROCESS

Why Mining Unstructured Supply Chain Data is a Goldmine

Revealing efficiency and unlocking value with data and process mining

Where Analytics, Data Mining, Data Science were applied in 2016

Data Mining Concepts and Process

Data Mining and the top-tier companies use them

5 Data Mining Techniques to Create Business Value

Market Insights: Harnessing the Power of Data Mining

Business Understanding

Data Understanding

Data Preparation

Modelling

Evaluation

Deployment

Final thoughts

Abhishek Kori的更多文章

Feature prioritization: Seven questions to ask yourself

Summary: The Root Causes of Failed Product Efforts - Inspired Book

Rock Paper Scissor classifier using fast AI and restnet 34 pretrained model

ML based sentiment analysis of movie reviews

Dead simple heroku flask app

社区洞察

其他会员也浏览了

Data Mining

Selecting the Right Data Mining Service Provider for Maximum Impact

DATA MINING PROCESS

Why Mining Unstructured Supply Chain Data is a Goldmine

Revealing efficiency and unlocking value with data and process mining

Where Analytics, Data Mining, Data Science were applied in 2016

Data Mining Concepts and Process

Data Mining and the top-tier companies use them

5 Data Mining Techniques to Create Business Value

Market Insights: Harnessing the Power of Data Mining