CRISP-DM: A Comprehensive Framework for Data Science Projects
Hossein Habibinejad
Senior Business Financial Analyst | Data Analytics & BI | Modélisation & Stratégie d’Entreprise | Gestion des Processus & Opérations | à la recherche d’un stage en Data Analytics, IT & Finance
CRISP-DM, which stands for CRoss?Industry?Standard?Process for?Data?Mining. CRISP-DM was developed in 1999 by a consortium of industry and academic partners, and it has since become the de facto standard for data mining, analytics, and data science projects.
CRISP-DM consists of six sequential phases that cover the entire data science life cycle:
1. Business Understanding: This phase involves understanding the business problem, the objectives, and the success criteria of the project. It also involves assessing the current situation, identifying the stakeholders, and defining the data mining goals.
2. Data Understanding: This phase involves collecting, describing, exploring, and verifying the quality of the data that is available or needed for the project. It also involves identifying any data issues or gaps that need to be addressed.
3. Data Preparation: This phase involves transforming, cleaning, integrating, and formatting the data for modeling. It also involves selecting, constructing, and reducing the features or variables that will be used in the analysis.
4. Modeling: This phase involves applying various modeling techniques to the prepared data, such as regression, classification, clustering, or association analysis. It also involves selecting the appropriate parameters, methods, and tools for each technique.
5. Evaluation: This phase involves assessing the performance and validity of the models against the data mining goals and the business objectives. It also involves comparing and selecting the best model or models for deployment.
6. Deployment: This phase involves deploying the selected model or models into the operational environment, where they can be used to generate insights or predictions for the business. It also involves monitoring and maintaining the models over time.
Each phase of CRISP-DM consists of several tasks that describe the specific activities and outputs of the process. The tasks are not fixed or prescriptive, but rather flexible and adaptable to different situations and needs. The CRISP-DM model also allows for iteration and feedback between phases, as new insights or challenges may arise during the project.
To illustrate how CRISP-DM can be applied in practice, let's look at some examples of data science projects that have used this process model.
## Example 1: Predicting Customer Churn for a Telecom Company
Customer churn is a common problem for many businesses, especially in competitive industries like telecommunications. Churn refers to the loss of customers who switch to another provider or stop using a service. Predicting customer churn can help businesses retain their customers, increase their revenue, and improve their customer satisfaction.
A telecom company wanted to use data science to predict customer churn and identify the factors that influence it. They followed the CRISP-DM process model to conduct their project:
Business Understanding: The company first defined its business problem as reducing customer churn and increasing customer loyalty then identified their stakeholders as the marketing and customer service departments, who would use the results of the analysis to design targeted campaigns and interventions. They defined their data mining goals as building a predictive model that can classify customers into churners or non-churners based on their demographic and behavioral data. They also defined their success criteria as achieving a high accuracy, precision, recall, and ROC-AUC score for their model.
Data Understanding: The company collected data from various sources, such as customer records, billing information, service usage, customer feedback, and market research, and described its data by examining its structure, size, format, and type. They explored their data by performing descriptive statistics, visualizations, and correlations. They verified their data quality by checking for missing values, outliers, duplicates, inconsistencies, and errors.
Data Preparation: The company prepared their data for modeling by performing several steps:
? They cleaned their data by imputing missing values, removing outliers and duplicates, correcting errors, and standardizing formats. They integrated their data by joining different tables based on common keys, such as customer ID or phone number.
? They formatted their data by converting categorical variables into dummy variables, scaling numerical variables into a common range, and splitting their data into training and testing sets. Then selected features by applying feature engineering techniques, such as creating new variables based on existing ones, aggregating them into summary statistics, and applying domain knowledge.
? They reduced their features by applying feature selection techniques, such as filtering based on variance or correlation, or using wrapper or embedded methods.
Modeling: The company applied various modeling techniques to their prepared data, such as logistic regression, decision trees, random forests, support vector machines(SVM), and neural networks. They selected the appropriate parameters, methods, and tools for each technique, such as regularization, pruning, bagging, boosting, kernel, and activation functions. They also used cross-validation and grid search to optimize their hyperparameters and avoid overfitting or underfitting.
Evaluation: The company evaluated the performance and validity of its models against its data mining goals and business objectives. They used various metrics, such as accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrix, to compare and rank their models. They also used various methods, such as learning curves, residual plots, and feature importance, to assess the robustness and interpretability of their models. They selected the best model or models for deployment based on their evaluation results.
领英推荐
Deployment: The company deployed its selected model or models into the operational environment, where they could be used to predict customer churn and identify the factors that influence it. They also developed a dashboard and a report to present their findings and recommendations to the stakeholders. They also monitored and maintained their models over time by updating them with new data and evaluating their performance.
## Example 2: Analyzing Sentiment of Movie Reviews
Sentiment analysis is a popular application of natural language processing (NLP) that involves extracting the emotional tone or attitude of a text. Sentiment analysis can help businesses understand their customers' opinions, preferences, and feedback. It can also help researchers study the social and psychological aspects of human communication.
A movie review website wanted to use data science to analyze the sentiment of movie reviews posted by their users. They followed the CRISP-DM process model to conduct their project:
Business Understanding: The website defined its business problem as enhancing its user experience and increasing its user engagement. They identified their stakeholders as the website owners, developers, and users, who would benefit from the results of the analysis. They defined their data mining goals as building a sentiment analysis model that can classify movie reviews into positive or negative based on their text. They also defined their success criteria as achieving a high accuracy, precision, recall, and F1 score for their model.
Data Understanding: The website collected data from their own database, which contained thousands of movie reviews posted by their users. They described their data by examining its structure, size, format, and type. They explored their data by performing descriptive statistics, visualizations, and word clouds. They verified their data quality by checking for missing values, duplicates, spam, and offensive language.
Data Preparation: The website prepared their data for modeling by performing several steps:
? They cleaned their data by removing missing values, duplicates, spam, and offensive language.
? They formatted their data by converting text into numerical vectors using techniques such as bag-of-words, term frequency-inverse document frequency (TF-IDF), or word embeddings.
? They split their data into training and testing sets.
? They selected their features by applying feature engineering techniques, such as creating new variables based on sentiment lexicons, n-grams, or part-of-speech tags.
? They reduced their features by applying
techniques, such as filtering based on chi-square or mutual information, or using wrapper or embedded methods.
Modeling: The website applied various modeling techniques to their prepared data, such as naive Bayes, k-nearest neighbors, logistic regression, support vector machine, decision tree, random forest, or neural network.
? They selected the appropriate parameters, methods, and tools for each technique, such as smoothing, distance metric, regularization, kernel, pruning, bagging, boosting, or activation function.
? They also used cross-validation and grid search to optimize their hyperparameters and avoid overfitting or underfitting.
Evaluation: The website evaluated the performance and validity of their models against their data mining goals and business objectives.
? They used various metrics, such as accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrix, to compare and rank their models.
? They also used various methods, such as learning curves, residual plots, and feature importance, to assess the robustness and interpretability of their models.
? They selected the best model or models for deployment based on their evaluation results.
Deployment: The website deployed their selected model or models into the operational environment, where they could be used to analyze the sentiment of movie reviews posted by their users.
? They also developed a dashboard and a report to present