Cross-Industry Standard Process for Data Mining (CRISP-DM)
Image by Free-Photos from Pixabay

Cross-Industry Standard Process for Data Mining (CRISP-DM)

CRISP-DM is a proven process to carry out data mining. According to Wikipedia, it originated in 1996 became a European Union project under the ESPRIT funding initiative in 1997. I came across this topic when I was reading this book “Data Science for Business”, which really caught my attention. Let’s see what are the stages and outcomes.


No alt text provided for this image

Kenneth Jensen / CC BY-SA (https://creativecommons.org/licenses/by-sa/3.0)

As the diagram depicts, its a cyclic process and it can take many rounds of iteration before finalizing the outcome. Going through the full cycle and not having to solve the intended business problem is not a failure. Often times, in each cycle, the team learns new facts about the data or business and generate new ideas from the iteration.

Business Understanding

No alt text provided for this image

Image by StartupStockPhotos from Pixabay 

Projects rarely come with pre-defined well understood data mining problem. The diagram has cycles within the main cycle, where teams go back and forth between the different stages of the data mining process, as initial understanding may not be the best or complete one.

In this stage, business-analysts formulate one or more data science problems from the business requirement. This will lead to dividing the main problem into subproblems involving building models for classification, regression, probability estimation etc. The team should try to answer questions like “What exactly do we want to do? How exactly would we do it? What parts of this use scenario constitute possible data mining models?”. While working in subsequent stages, the team may come back to this stage and modify the use case scenario to better reflect the business problem.

Data Understanding

No alt text provided for this image

Image by Pexels from Pixabay 

The data which the team collects will not exactly match the business need. In historic data collection process, the data may have been collected with a different intention in mind. Data collected may be in a different format or size and come with variety. Even the accuracy of the data may not be precise for the current business problem.

Collecting data comes at a cost. Parts of it may be freely available, some of it may be needed to be purchased and some of it may require different projects or team that needs to be set up to collect the required data.

An important aspect of Data understanding phase is the costs and benefits of each data source and whether the investment is justified.

Data Preparation

No alt text provided for this image

Image by Werner Heiber from Pixabay 

The data which is available for further processing or building model will rarely be in a format which the ML algorithms or tools understand. It needs to be converted and formatted to yield better results. Some of the data preparation includes converting data to a tabular format, filling in missing values or removing the data point which contains a missing value, converting the data into a different data format etc. In addition to cleaning and formatting, data also needs to be normalized or scaled.

One important consideration that needs to be made is about “leaks”. A leak is when a variable available in historical data which give information about the target label may not be available during the time of decision making. This variable may occur or generate after the target event has taken place. It's paramount to make sure all the variables will be available during the decision-making process in the production setting.

Modelling

No alt text provided for this image

Image by Lubos Houska from Pixabay 

This is a relatively small and easy phase of the data mining process compare to other stages. In this stage, data-scientists build a predictive model with data available from “data preparation” stage. There are a number of tools and frameworks available which takes in the data and labels predicts the outcome of unknown input in case of supervised learning or transforms the data in case of unsupervised learning. This job is a deep dive subject on its own, where it involves a lot of statistics, probability and calculus. From a high-level business process point of view, there is not much add and this stage is more of a Data Scientist expertise.

Evaluation

No alt text provided for this image

Image by Michal Jarmoluk from Pixabay 

This process involves testing the result of the data mining model generated from the previous step. This process is rigorous and plenty of scrutinies are required to test the model in rigour. We need to have confidence that the patterns emerging out of the model are truly regular patters and not a “once a while” random instance.

We should also make sure that the model which has been generated serves the original business goal. Often the model is evaluated in a controlled lab setting. These tests are conducted to simulate the real world, as close as possible. Even though the results from these tests pass the scrutiny, there may be external factors which make it impractical. In the case of fraud detection, the model might be 99% accurate in a lab setting but in real life, it might trigger a lot of false positive. These false positives might lead to huge economic losses and customer dissatisfaction.

In an industry setting, there are a lot of assessment and evaluations which are done at different levels of management. Before a model goes into production it needs to satisfy different stakeholder and make them understand how the model works. If the model is too complicated or incomprehensible to the stakeholder, the team might not get “sign-off” from management to go ahead deploy it to production.

Deployment

No alt text provided for this image

Image by WikiImages from Pixabay 

Most of the time “data mining techniques” themselves are deployed. Steps like collecting, processing and formatting the data might also be added to the deployment pipeline. As the author says

“deploying the data mining system itself rather than the models produced by a data mining system are (i) the world may change faster than the data science team can adapt, as with fraud and intrusion detection

(ii) a business has too many modelling tasks for their data science team to manually curate each model individually”

Deploying the model in production includes re-coding the model to suit production quality rubric and match the speed of production systems. Data scientists usually build models which are working prototypes and may not be in the best shape to be deployed in production. This handing over the final model to software engineers to rebuild with production quality is risky. It is helpful to remember “Your model is not what the data scientists design, it’s what the engineers build”. It's really important for the development team members involved early in the data science project. Initially, they can act as advisors to the data science teams about the production systems. As the project progresses after deployment these engineers will take ownership of the product.

Final thoughts

Regardless of whether the deployment was successful or not, these prosses will lead to a greater understanding of the business problem and engineering solutions. Subsequent iterations may lead to better outcomes. Just the process of thinking about the business need, the data and the performance goals will give rise to new business ideas.

This blog has been inspired by the book “Data Science for Business”. It’s a great read with lots of case scenario, tips and examples.

Let me know your thoughts about this whole data mining development cycle, looking forward to hearing from you.

要查看或添加评论,请登录

Abhishek Kori的更多文章

社区洞察

其他会员也浏览了