Standard Machine Learning(ML) Procedure..!

Standard Machine Learning(ML) Procedure..!

Machine Learning(ML) is one of the most demanding and fast-growing techniques in the industry. But it lacks a concrete standard procedure and practices to build efficient ML models and applications. It is resulting in a weak and inefficient system prone to failure in the long run.

Contrary to this, a capitalist economy and a fast-growing sector require a reliable and efficient system to analyze customer behavior and gain insightful insights to build profitable business decisions in this competitive and fast-changing economy.

Machine Learning uses mathematical models to analyze new data. Further, the models are improved through retraining on the new data being generated. Training the model involves separating the dataset into a training set and a testing set to validate the accuracy of the model. Along with the ML requirements, a few more aspects need to be addressed including focusing on improving the IT infrastructure and environment to ensure the successful implementation of the ML model before deployment. To efficiently work on an ML project, multiple well-defined development lifecycle models are adopted to drive the project forward and complete specific tasks but only some of the most popular and widely accepted ML models are discussed below.

CRISP-DM?— CRoss Industry Standard Process for Data Mining

CRISP-DM is an industry-level accepted procedure that serves as the base for any data science project. It was proposed in 1999 to standardize the data mining process across the industry. An important element missing in the CRISP-DM is defining the roles and responsibilities of individual members in the project which complicates the project management and clarity of workflow within the team on the project. It is technology-agnostic, i.e. it is the generalized procedure and is interoperable among various systems.

Implementation of the CRISP-DM decides whether the methodology used is an Agile or Waterfall procedure. Some consider it a rigid Waterfall process because of excessive reporting requirements. Moreover, if you entitle the project to detailed upfront planning and choose not to adopt an iterative workflow, you are following the Waterfall process. On a similar note, CRISP-DM also indicates the feature of agile principles. As the sequence of the workflow is not rigid, following back-and-forth iterations between different phases winds up the agile approach.

CRISP-DM, the machine learning process involves 6 steps:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

No alt text provided for this image
CRISP-DM: Machine Learning Procedure Loop

Example:?Anti-Money Laundering System…

Money Laundering is the process of conversion of black money to white money via illegal channels.

  • From the point of view of CRISP-DM, the first phase comprises?Business Understanding. In these systems, the primary task is to build models that are able to recognize illegal transactions trying to conceal their origin, ownership, or transaction details.
  • The next step involves?Data Understanding, this phase involves taking into consideration the data available for mining. The biggest challenge faced by the data scientist in this phase is of collecting data from different sources and merging it to have it in similar formats. Cost and time are other factors taken into consideration during this phase. This is an important phase of the CRISP-DM procedure in order to avoid any inappropriate data preprocessing in the Data Preparation phase. Any ML project is prone to a high risk of failure if build on the poor-quality of data. The statistical description of the data is a must to successfully build any ML model. Data quality verification is an extremely crucial step. This process helps in reducing the risk of poor performance as low-quality data is discarded before taking it into the consideration.
  • Here comes the most tedious and time-taking process of the CRISP-DM, the?Data Preparation?phase. Feature selection is one of the important steps of this phase. As more features are selected, the more complex the model becomes and the more samples are necessary. Contrary to this, having fewer features makes the choice of feature selection difficult as important features can not be ruled out, and dropping the dependent features might affect the model performance. This is called the?curse of dimensionality. Filter method(ML model not considered while making the feature selection), wrapper method(learning model is used to calculate the significance of features), and embedded method(combines feature selection and classifier construction step) are a few methods used in ML feature selection. Discarding features should be well documented and based on a clear business understanding. Skewness and outlier detection must be done carefully to improve the results of the ML models. Different sampling techniques should be adopted to retain the statistical properties of the ML model. Oversampling and undersampling of any data class while ruling out the data must be avoided. Noise reduction filters the unwanted from the dataset but comes with the risk that the erroneous filtering could remove the important parts of the data. Data imputation helps to complete the missing values in the dataset using several mathematical measurement techniques like mean, mode, median, model prediction, matrix factorization, or optimization. Along with that new features can also be added which might improve the performance of the ML model. Several feature construction methods like clustering, PCA, and auto-encoders are quite helpful. Underutilized features must be dropped as it increases the complexity and storage requirements of the ML model. In the case of Neural Networks, feature engineering must be avoided.
  • Once the data is clean and suits the requirements of the business, the?Modeling?phase begins. The choice of ML model depends on the business requirements and the data obtained. Several different model specializations, architecture, training, and learning methods(like ensemble, supervised, and unsupervised learning) are used to build the final ML model. Reproducibility has always been an issue with ML models. To tackle this issue, usually after training and testing the ML models on the dataset and getting the desired results; results and models are saved along with the hyperparameters, algorithms, run-time environment description, and statistical dataset measurement. To ensure generalization, a cross-validation technique is also performed. The objective of building the trained model depends on the learning problem. Regularization is used to avoid the issue of over-fitting. Cross-validation helps in feature selection and hyperparameter optimization. Similarly, the ensemble method technique can also be used to train multiple models and make decisions based on the aggregate decision of the individual model.
  • The?Evaluation?phase takes care that which model best meets the business requirements. To validate the accuracy of the trained ML model, it is trained on a training set(a subset of the entire sampling space available) and once trained, tested on the testing set to check for accuracy on the unseen data. There is a risk associated with the leakage of the data from the test set and, it is the best practice to separately hold back the test set disjointed from the sampling space. Another important factor to consider while training and testing the ML model is to maintain the statistical properties of the validation and testing dataset and reduce the oversampling of any class in any particular set. Another challenge faced by the data scientist while training ML models is to ensure that the model doesn’t overfit or underfit the data. Sometimes certain ML models perform great on the frequent classes of the dataset thus, increasing the overall accuracy of the model. But such models must be ignored as the real-world data is quite noisy and the concept of generalization is ignored in such cases. Finally, ML experts must decide which ML model can be deployed. Performance metrics like accuracy, precision, and recall must be documented and compared to the business requirements and the success criteria. And if the criteria are not matched, backtracking of the modeling and data understanding phase becomes a necessity.
  • At the very end, once the models are trained and success criteria are checked; our final phase is?Deployment. In this step, ML models are integrated into the existing environment, and the efficiency of the ML models is observed. Inference hardware and software based on the business requirement and computation resources must be adopted to make the ML applications scalable and robust.
  • Last but not least, as the new data keeps coming in, models must be retrained and further hyper-parameter tuning must be done in order to reduce the ‘Model Staleness’. Even after deploying the best ML model in the deployment, it is extremely crucial to assure user acceptance and usability. And a fall-back plan must be prepared in case of outages and unforeseen error cases. Timely updating of the hardware and software is essential to avoid any data and revenue loss.

Which process is considered a better approach? (Agile OR Waterfall)

On a lighter note, Agile methodology under vertical slicing() is considered a better solution to the ML model development problem. Because —

  1. It provides users feedback at an early stage.
  2. Early evaluation of model performance.
  3. As per feedback and requirements, necessary changes can be made.

Benefits:

  1. Generalization?— All data analysis projects begin with the business problem statement to be solved. CRISP-DM provides an accurate workflow for any data science activities taking place.
  2. Implementation?— CRISP-DM is a technique that can be easily implemented by anyone without much training. Project Management becomes easy with CRISP-like methodology because of iterations and correct phase identification.
  3. Flexible?— Because of the agile methodology being implemented, it is easy to iterate over the product multiple times and make changes as per the business requirements.

Weakness:

  1. Rigid?— As soon as the Waterfall methodology is adopted, the model becomes difficult to implement. Heavy documentation requirement is the initial challenge faced by the data scientist if the Waterfall methodology is adopted.
  2. Management?— CRISP-DM is not considered a true management project approach as it bounds its user to a single entity or group of entities and avoids team interaction necessary for the implementation of big projects.

Alternatives:

  1. SEMMA?— SEMMA(Sample, Explore, Modify, Model, and Assess) was developed by SAS, this deployment procedure was designed to help us especially guide through tools in SAS Enterprise Miner for Data Mining problems. It has an extremely narrow focus on the technical steps of data mining, it completely skips the Business Understanding phase and begins with the Data Preparation processes. Also, SEMMA does not cover the ML Deployment aspects. Still, a potentially useful process to follow data mining steps.
  2. KDD & KDDS?— Back in 1989, Knowledge Discovery in Database(KDD) was a quite popular practice of knowledge discovery through data mining or extraction of patterns and information from large datasets using statistics and ML techniques. Similar to SEMMA, KDD & KDDS also narrowly focus on the Business Understanding phase and the ML Deployment aspects and begin with the Data Preparation processes. In 2016, Knowledge Discovery in Data Science(KDDS) was published as an end-to-end process to deliver valuable solutions. KDDS expands upon KDD to address the big-data problems. It defines 4 different phases: assess, architect, build and improve. And consists of 5 different processes: plan, collect, curate, analyze, and act.


CRISP-ML(Q)?— CRoss Industry Standard Process for Machine Learning with Quality Assurance.

CRISP-ML(Q) is an advanced version of the pre-existing CRISP-DM process model, a widely accepted data mining process that fails to address the ML-specific tasks to ensure high-quality ML products.

CRISP-ML(Q) consists of six different phases:

  1. Business and Data Understanding?— Identification of the scope of the project, success metrics, and feasibility of the project is the primary goal of deploying the ML application. A clear understanding of economic success criteria(like KPI), and statistical metrics are extremely critical for the success of the ML application. ML canvas framework provides a structured way to achieve this task. As soon as the business problems are clearly defined, the process of data collection and verification begins which is quite challenging and tedious. Another critical requirement is the statistical documentation of the data along with the data requirement. As this step is the foundation for data quality assurance during operational phases.
  2. Data Preparation?— The second phase focuses on preparing the data for the modeling phase. Feature engineering, data cleaning, imputation, normalization, and standardization are a few crucial steps performed during this phase. We need to clean the data and discard any unnecessary and invaluable features not satisfying the data requirement metrics. Dealing with unbalanced classes(over-sampling or under-sampling), and noisy and reductant data are part of this phase. Noise reduction is achieved by unit testing of data. Processes like scaling, normalizing, and outlier detection can be applied to mitigate faulty data values. Depending on the model, feature engineering and data augmentation methods are chosen like clustering and one-hot encoding. These processes avoid the risk of erroneous data being transferred to the further phases of deployment. To ensure effective reproductivity, ML transformation pipelines are developed for data pre-processing and modeling,
  3. Model Engineering?— In this phase, multiple ML models are trained on the data collected. Constraints and requirements from the Business and Data Understanding phase shape this phase. Along with model selection, different model specializations, architecture, training, and learning methods(like ensemble, supervised, and unsupervised learning) are used to build the final ML model. Reproducibility has always been an issue with ML models. To tackle this issue, usually after training and testing the ML models on the dataset and getting the desired results; results and models are saved along with the hyperparameters, algorithms, run-time environment description, and statistical dataset measurement. Finally, the ML workflow package is built into a pipeline to build a repeatable and reusable training process.
  4. Model Evaluation?— The evaluation step is followed by the model building or engineering phase, also known as offline testing. The best-performing model based on accuracy, robustness, scalability, complexity degree, fairness, and other statistical measurements is chosen and evaluated on random seed data. Model robustness is checked by testing it on the wrong and noisy input. ML model built until now provides trust, and efficiency. Now ML models are put to test in the regulatory environment and help humans by assisted decision-making.
  5. Model Deployment?— In this phase, the trained model is deployed into the existing system. Deploying any ML model means exposing it to real-time data and relying on its predictive behavior in the form of an interactive dashboard, a plugged-in component of a software kernel architecture, or any web-service distributed system. Inference hardware definition, model evaluation in a real-time environment, user acceptance, fall-back plan during outages, and accurate deployment strategy are necessary to roll out the ML model in the environment.
  6. Monitoring and Maintenance?— As soon as the ML model is rolled out into production, the main challenge faced by the engineers after deployment is that of ‘Model Staleness’. Usually, a drop in the model performance is observed as soon as the model starts operating on the unseen data. Also, hardware performance and existing software stack affect the model performance. To overcome this problem, retraining the ML model on the new data, updating the hardware and software resources, and adjusting the ML model definitions as per the updated business use case are important steps of the maintenance phase. Constant monitoring and retraining of the model help in building robust and efficient ML applications.

No alt text provided for this image
Updated CRISP-DM: CRISP-ML(Q) Approach for Quality Assurance

Even though an ordered framework is put forth, for successful implementation of the ML deployment project, constant iteration and updation of the output of the later phase is required for building a better solution. Each phase of the framework is introduced with a quality assurance methodology.

CRISP-ML(Q) v/s CRISP-DM:

Before the development of the CRISP-ML(Q), the strengths and weaknesses of the CRISP-DM were considered to build an improved procedure for ML model development. As a result, an enhanced version with an extended Model Management phase was built for ML model development. This phase was added to retrain the pre-trained model in order to improve its accuracy. More interconnections were built into the procedure to enhance the iterative nature of the ML models. Some additional procedures like Assess IT Infrastructure task were taken from the Assess Situation Task and added to the CRISP-ML which were not part of the CRISP-DM model but are necessary for the successful implementation of the ML projects. CRISP-ML is a special type of ML algorithm which focuses primarily on regression or classification tasks. And helps in predicting continuous or categorical outcomes respectively.

No alt text provided for this image
Generic Tasks performed by the Improved Process Model

References:

  1. CRISP-ML(Q): MLOps Official Documentation
  2. Article: CRISP-DM
  3. Article: CRISP-ML(Q)

要查看或添加评论,请登录

Apoorv Pathak的更多文章

  • Summarized Union Budget 2k23 — Nirmala Sitharaman

    Summarized Union Budget 2k23 — Nirmala Sitharaman

    On 1st February 2023, Indian Finance Minister Mrs. Nirmala Sitharaman presented the Union Budget of 2023 in the…

  • K-Shaped Recovery - Raghuram Rajan

    K-Shaped Recovery - Raghuram Rajan

    K-shaped recovery usually occurs following a recession. In this type of recovery, different sectors of the economy…

社区洞察

其他会员也浏览了