The Data Science Journey: From Problem Definition to Model Deployment - An Overview
image source: https://medium.com/

The Data Science Journey: From Problem Definition to Model Deployment - An Overview

In the world of data science and machine learning, a structured approach is paramount to success. This article will guide you through the intricacies of the data science pipeline, from defining the problem to deploying a machine learning model. This breaks down each step into understandable terms while obfuscating the highly technical aspects that drive these processes.


Problem Definition

Defining the problem is a critical phase in data science that sets the direction for the entire project. It involves gaining domain knowledge, formulating a clear problem statement, defining the expected outcome, defining the scope and constraints, problem framing to data-driven analysis and modeling, understanding data requirements and availability, conducting a feasibility assessment, and aligning with stakeholders. A well-defined problem provides a solid foundation for subsequent steps in the data science pipeline, from data collection and preprocessing to model development and deployment.


Data Collection

Identify data sources and collection methods, which may include databases & and data warehouses, APIs, web scraping, sensor data, social media, and text and documents. Data can be either;

  • Structured data: Data that is organized and presented in a highly organized, tabular format where each data point or observation is neatly structured into rows and columns. E.g., Relational Databases, spreadsheets, and CSV files.
  • Unstructured data: Data that lacks a predefined structure or format, making it more challenging to analyze and process using traditional techniques. E.g., text, images, audio, or videos.

Most of the traditional ML models require the conversion of unstructured data to structured data via the preprocessing steps. Converting unstructured data to structured data involves extracting meaningful information and organizing it into a structured format.


Exploratory Data Analysis (EDA) and Data Preprocessing

The EDA involves understanding, visualizing, and analyzing data to gain insights into its characteristics and relationships. Data preprocessing involves cleaning, transforming, and organizing raw data into a format that is suitable for the next phases. The specific methods and techniques to be used will depend on the nature of the data, the machine-learning (ML) algorithms that will be employed, and the objectives of the analysis. Proper data analysis and preprocessing can significantly impact the quality and effectiveness of the ML models.

This is an iterative process that ensures the data is accurate, complete, and in the right format for the algorithms that will be used in the latter phases of the data science pipeline. This plays a crucial role in preparing the data for subsequent modeling and decision-making processes, helping data scientists make informed choices and identify potential issues early in the analysis pipeline. Once the data is loaded for analysis, various steps are involved.

Variable Identification

As the first step, we need to understand the types of variables in the dataset. Variables can be broadly categorized as continuous or categorical. Identifying variable types helps to choose appropriate techniques for analyzing and visualizing data.

  • Continuous data: These are numerical variables that can take on an infinite number of values within a range. E.g., age, income, or temperature.
  • Categorical data: These represent categories or labels and can take on a limited number of distinct values. E.g., gender, color, or product type.

Understanding Data

Once the variables are identified, we need to gain a deeper understanding of the data. This involves examining the Nature of Data, and Removing Duplicates.

  • Data Summary: For continuous data - basic statistics like mean, median, standard deviation, and quartiles; For categorical data - frequency distribution of the categories.
  • Data Distribution: Visualizing the distribution of data through histograms (for continuous data) or bar charts (for categorical data). This helps to identify patterns and skewness.
  • Removing Duplicates: Detecting and eliminating duplicate records from further analysis to ensure data quality and consistency, data accuracy, and unbiased analysis.

Handling Missing Values:

Missing data is a common issue in real-world datasets. Missing values can be problematic for many ML algorithms, as they may not handle them well. Missing values are to be treated depending on the context by using various techniques.

  • Removing rows (data entries) with missing values: This is appropriate when the missing data is minimal and won't significantly impact the analysis. However, careful attention is required as removing rows with missing values related to certain variables or patterns in the data may introduce bias and distort the analysis.
  • Removing columns (entire features) with missing values: If many data points are missing in a particular feature/s, it is better to consider the entire feature removed from further analysis. However, feature importance analysis and/or domain expertise must be incorporated in such cases.
  • Imputation: Imputing missing values with estimated or predicted values. Common methods include mean, median, mode imputation, or more sophisticated techniques like regression imputation.
  • Use appropriate ML techniques: Some ML algorithms can effectively handle missing values (e.g., Decision tree-based algorithms). Therefore, it is needed to consider the intended ML algorithms to be used before removing/imputing missing values at this phase.

Visualization and Analysis

Visualizing and analyzing data uncovers patterns and relationships. This is by means of using appropriate visualizations and statistical methods for Univariate, Bivariate, and Multivariate analysis:

  • Univariate Analysis: Examining individual variables in isolation. Tools such as histograms, box plots, bar charts, and summary statistics are used to understand each variable's distribution and characteristics.
  • Bivariate Analysis: Analyzing the relationships between pairs of variables. Scatter plots, correlation matrices, and stacked bar charts can reveal how two variables interact or correlate with each other.
  • Multivariate Analysis: Exploring interactions among three or more variables. Techniques like heatmaps, 3D plots, and dimensionality reduction methods (e.g., PCA) help visualize complex relationships.
  • Statistical tests, such as t-tests or chi-squared tests, may be applied to assess the significance of observed relationships or differences in the data.

Dealing with Outliers

Outliers are data points that deviate substantially from the majority of the data and can skew analysis results. Detecting and handling outliers is an essential part of the data science pipeline.

Common methods for outlier detection include Z-score analysis, the Interquartile Range (IQR) method, and visual inspection through box plots or scatter plots. Once the outliers are identified, further investigations are needed to understand the nature of the outliers and whether they are genuine extreme values or data entry errors. Domain expertise is utilized where applicable to gain insights into the data and determine the appropriate course of action.

Outliers are to be treated depending on the context; they can be removed, transformed, or dealt with based on the nature of the data and the specific analysis goals. Some examples are as follows;

  • Removing outliers: In such cases where the outliers are identified as data entry errors, measurement errors, or anomalies that are not representative of the underlying population, it is considered to remove them from the dataset. However, this removal can cause in loss of information.
  • Transforming data: To reduce the impact of the outliers but still retain the information that they contain, the transformation of the features can be done. This technique is effective when outliers have a skewed effect on the distribution of data, making it more symmetric. Appropriate transformation is to be used, e.g., logarithmic, square root, or reciprocal transformations.
  • Imputing outliers: We can use data imputation when we want to retain the data points but reduce their impact on analysis. Imputing replaces outlier values with estimated or imputed values based on the distribution of the data.
  • Data capping: We can set a threshold to the data and replace all values above/below the threshold with the threshold value itself. Although this method preserves some information about the outliers, it will create a biased dataset.
  • Categorizing outliers: In contrast to all the above methods, we can preserve the data as it is and introduce a new categorical variable to flag the outliers. This approach is effective when outliers represent a distinct group or have unique characteristics that are relevant to the analysis.

Scaling or Normalizing Numerical Variables:

Different ML algorithms can be sensitive to the scale of numerical features. Scaling or normalizing ensures that all features have the same scale, preventing some features from dominating others during model training. Common scaling methods include Min-Max scaling (scaling features to a specific range) and z-score standardization (scaling to have a mean of 0 and a standard deviation of 1).

Encoding Categorical Variables:

Many ML algorithms require numerical input, so categorical variables need to be converted into a numerical format. Common encoding techniques for categorical variables include:

  • One-hot encoding: creating binary columns for each category.
  • Label encoding: assigning a unique number to each category.
  • Ordinal encoding: assigning numerical values to categories based on their predefined ordinal relationship.
  • Binary encoding: convert each category into binary code (0s and 1s) and create separate binary columns for each digit in the binary representation.
  • Frequency encoding: replace categories with their corresponding frequencies or counts in the dataset.
  • Target/mean encoding: replace categories with the mean of the target variable for each category. Often used in classification problems.

The choice of encoding method depends on the nature of the categorical data, the relationship between categories, and the requirements of the ML algorithm.

Addressing Imbalanced Data:

Imbalanced datasets occur when one class or category dominates the data. When we refer to "imbalanced data," we are often talking about an imbalance in the target variable, which is the variable you are trying to predict or classify. This leads to biased model training. Techniques for addressing imbalanced data include;

  • Handle at data level: If possible, collecting more data for the minority class can balance the dataset naturally.
  • Resampling: Oversampling the Minority Class (This involves increasing the number of instances in the minority class by duplicating or generating synthetic samples). Undersampling the Majority Class (In this approach, you reduce the number of instances in the majority class to balance the class distribution).
  • Use different evaluation metrics: NB: This is to be used in the "Model Evaluation" phase. We can use the dataset as it is and later use appropriate evaluation metrics instead of relying solely on accuracy as the evaluation metric. Common evaluation metrics include Precision, Recall, F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC).
  • Employ advanced algorithms designed for imbalanced data: NB: This is to be used in the "Model Selection" phase. Some machine learning algorithms are designed to handle imbalanced datasets more effectively. These algorithms give more weight to the minority class or adapt their decision boundaries to address class imbalance. Examples include Random Forest with Class Weights, Gradient Boosting with Cost-Sensitive Learning, and Support Vector Machines (SVM) with Custom Kernels. We can further improve by employing ensemble methods, such as Bagging and Boosting ( where the base learners address class imbalance).
  • Hybrid approach: Incorporating multiple techniques, such as oversampling, undersampling, and employing advanced algorithms, can often lead to better results than using any single technique.

Handling Skewed Data:

In data science, "skewed data" refers to a situation where the distribution of data points within a dataset exhibits a significant skewness or deviation from a normal distribution. Skewed data distributions can affect the performance of some algorithms, especially those sensitive to the distribution of data (E.g., Linear Regression, Logistic Regression, Principal Component Analysis (PCA).

Popular strategies to handle skewed data include;

  • Data transformation: Logarithmic, power, or inverse transformations to make the distribution more symmetric.
  • Employ Robust Algorithms: NB: This is to be used in the "Model Selection" phase. Rather than performing explicit data transformation, we can use ML algorithms that are less sensitive to skewed data, E.g., such as tree-based algorithms, Neural Networks, support vector machines (SVMs), and Clustering Algorithms.
  • Work on features: Removal of highly skewed features (should followed by a proper feature selection process) and introduction of new features based on the original skewed features can reduce skewness

Feature Selection or Dimensionality Reduction:

Feature selection is the process of choosing a subset of the most relevant features from a larger set of available features in the dataset. The objective is to retain the essential information while discarding irrelevant or redundant features. Some methods used in feature selection are:

  • Correlation Analysis: Measures the statistical relationship between pairs of features and the target variable. Features with low or near-zero correlation with the target variable may be candidates for removal.
  • Univariate Feature Selection: Evaluate each feature's relationship with the target variable independently. Techniques like chi-squared tests, ANOVA, or mutual information can be used to rank and select features based on their statistical significance.
  • Recursive Feature Elimination: An iterative method that starts with all features and systematically eliminates the least important features based on a chosen model's performance.
  • Feature Importance from Tree-Based Models: Tree-based models like Random Forest or Gradient Boosting provide feature importance scores. You can use these scores to identify and select the most important features.
  • L1 Regularization (Lasso): encourages sparsity by penalizing the absolute values of feature coefficients in linear models like Linear Regression or Logistic Regression. This results in some features having coefficient values of exactly zero, effectively selecting a subset of features.

Dimensionality reduction aims to reduce the number of features (dimensions) in the dataset while preserving as much of the original data's variance or information as possible. This is particularly valuable when dealing with high-dimensional data or when simplifying complex datasets. Two common techniques for dimensionality reduction are:

  • Principal Component Analysis (PCA): Linear dimensionality reduction technique that transforms the original features into a new set of orthogonal (uncorrelated) features called principal components.
  • t-Distributed Stochastic Neighbor Embedding (t-SNE): Non-linear dimensionality reduction technique that is particularly useful for visualizing high-dimensional data in a lower-dimensional space.

The choice between these techniques depends on the nature of the data, the goals of the analysis, and the requirements of the ML model.


Data Split

Data splitting is a fundamental step in machine learning and model development (particularly in supervised learning), and it involves dividing the dataset into distinct subsets, typically including a training set, a validation set, and a test set. These subsets serve different purposes and are essential for assessing and improving the model's performance.

  • Training Set: The training set is the largest portion of the dataset, typically comprising 60-80% of the data. It is used to train the machine learning model.
  • Validation Set: The validation set, which typically accounts for 10-20% of the data, is used for fine-tuning your model's hyperparameters.
  • Test Set: The test set, usually accounting for the remaining 10-20% of the data, serves as an independent dataset for evaluating the final performance of your trained model.

Proper data splitting is required to ensure the model's generalization performance accurately, making it a critical step in the machine learning pipeline. It reduces the risk of overfitting and provides a reliable assessment of the model's real-world capabilities. Hence, there are key considerations for data splitting:

  • Randomness: It's common to randomly shuffle the dataset before splitting it into training, validation, and test sets to ensure that the subsets are representative and avoid any inherent ordering effects. In contrast, time-series data should not be shuffled in this split.
  • Stratified Splitting: In classification tasks with imbalanced classes, it may be beneficial to use stratified sampling to maintain the class distribution across the training, validation, and test sets.


Model Selection

Model selection is a fundamental aspect of the data science workflow that involves choosing the most appropriate algorithm or model for a specific task. There is a vast array of ML algorithms and models to choose from. Selecting the right ML model is crucial because different models have varying strengths, weaknesses, and suitability for different types of tasks and datasets. This decision significantly impacts the performance and success of the overall project. Making the wrong choice can lead to suboptimal results, longer development cycles, and increased computational costs. The following are the key considerations in the model selection process:

  • Machine learning task: The choice of model depends primarily on the type of machine learning task. Common tasks include; Classification: Assigning data points to predefined categories or classes. Regression: Predicting continuous numerical values. Clustering: Grouping similar data points together based on their characteristics. Anomaly Detection: Identifying rare and abnormal data points. Time Series Forecasting: Predicting future values based on historical data.
  • Dataset size: The size of the dataset plays a role in model selection. Some models perform better with large datasets, while others are more suitable for smaller datasets. For example, deep learning models often require substantial data to generalize effectively, while simpler models like decision trees can work well with smaller datasets.
  • Data complexity: The nature and complexity of the data, including features and their relationships, influence model selection. For instance, linear models are suitable for data with linear relationships, while complex, non-linear models like neural networks may be needed for intricate patterns.
  • Model complexity vs. interpretability: Consider the trade-off between model complexity and interpretability. Complex models like deep neural networks can capture intricate patterns but may be challenging to interpret, while simpler models like linear regression and decision trees are more interpretable but may have limited capacity.
  • Computational resources: Consider the available computational resources, including processing power and memory. Deep learning models often demand significant computational resources, while simpler models are more resource-efficient.
  • Scalability: If the dataset is expected to grow over time, consider models that can scale efficiently. Some algorithms such as neural networks, SVM, ensemble models, and some clustering algorithms are more suitable for handling large datasets and capable of parallel processing in distributed computing environments.
  • Domain knowledge and constraints: Domain-specific knowledge and constraints, such as interpretability requirements or regulatory compliance, can guide model selection. Also, some domains have established best practices for certain types of problems.

The model selection process is typically guided by the information about data identified in the EDA. Then start with a simple, baseline model and progressively explore more complex models to establish a performance baseline. Perform a systematic evaluation of various models to choose the model that achieves the best balance between performance, interpretability, and resource constraints.


Model Training

Model training is when the selected machine learning model learns from the training data to capture underlying patterns and relationships.

The machine learning model uses the 'Training set' data to learn patterns and relationships that map features to target values. The model's parameters (weights and biases) are initialized before training begins. This choice of initialization can impact training convergence and the quality of the final model. The loss function (a.k.a., cost function or objective function) quantifies the error between the model's predictions and the actual target values. The goal of training is to minimize this error. Common loss functions include mean squared error (MSE) for regression and cross-entropy for classification.

Optimization algorithms, such as gradient descent or its variants (e.g., stochastic gradient descent), are used to update the model's parameters iteratively. These algorithms aim to find the parameter values that minimize the loss function. The training process continues until one or more convergence criteria are met, such as a specified number of epochs, achieving a satisfactory level of performance, or a predefined tolerance for the loss function. 'Validation set' can also be used in this process to monitor how well the model generalizes to unseen data. In such cases, training may be stopped early if the model's performance on the 'Validation set' starts to degrade, preventing overfitting.

Once training is complete, the trained model is evaluated using the 'Test set' to assess its generalization performance on unseen data.


Hyperparameter Tuning with Cross-Validation

Hyperparameters: Hyperparameters are settings or configurations that are not learned from the data but are set prior to training a machine learning model, e.g., learning rate, regularization strength, depth of a decision tree, number of hidden layers in a neural network. Hyperparameters significantly impact a model's performance and generalization. Therefore, finding the best hyperparameters is essential.

Cross-Validation: Cross-validation is a resampling technique that divides the dataset into multiple subsets, typically called folds. The most common form is k-fold cross-validation, where the dataset is divided into k equally sized parts (folds). The training process is repeated k times, each time using k-1 folds for training and one fold for validation. The model's performance (e.g., accuracy, mean squared error) is calculated for each validation fold.

Hyperparameter tuning with cross-validation helps find the best combination of hyperparameters for a model while avoiding overfitting. The following points explain how it works.

  • Hyperparameter grid: Define a range of hyperparameters to explore. E.g., specify a range of learning rates or a list of different values for the number of trees in a random forest. Techniques such as Grid Search and Random Search can be used.
  • Cross-validation: For each combination of hyperparameters in the defined range, perform k-fold cross-validation. Calculate the evaluation metric (e.g., accuracy) for each fold and get the average performance metric across all folds. Repeat this process for all hyperparameter combinations.
  • The best set of hyperparameters: Choose the combination of hyperparameters that yields the best average performance metric across all cross-validation runs. This set of hyperparameters is typically referred to as the "optimal" or "best" hyperparameters.
  • Final model: After selecting the best hyperparameters using cross-validation, train the final model using the entire training dataset with these optimal hyperparameters.

Hyperparameters can be tuned using a dedicated 'Training Set' and 'Validation Set'. However, Cross-validation ensures that hyperparameters are tuned in a way that reflects the model's performance on a variety of data subsets.


Model Evaluation

Once the model is trained, it is evaluated against the 'Test set'. This is done by making predictions on the 'Test set' using your trained model and calculating the evaluation metrics based on the predicted values and the ground truth (actual) values. This process assesses how well the trained model performs on new/unseen data and whether the model's performance meets the desired criteria and is suitable for the intended application. The choice of evaluation metrics depends on the specific ML task.

  • Classification: Accuracy, Precision, Recall, F1-Score, ROC Curve and AUC.
  • Regression: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (R2) Score.
  • Clustering: Silhouette Score, Within-Cluster Sum of Squares (WCSS).

Model evaluation is not a one-time process. Based on the evaluation, we need to decide whether to deploy the model, make improvements, or explore alternative approaches. E.g., If the model performance is below expectations, we may need to revisit data preprocessing, feature engineering, or hyperparameter tuning; If there are concerns about overfitting, consider collecting more data, applying regularization, or trying different algorithms.


Model Deployment & Maintenance

Once the above steps are successfully completed, the model is made available for practical use by integrating it into real-world applications or systems. This integration can take several forms such as API Integration, User Interface Integration, and Batch Processing.

The model deployment involves much more than simply using a trained model in a real-world context. It encompasses various technical, operational, and security considerations to ensure that the model functions effectively and reliably in production environments. Hence, this process requires attention to different aspects such as; choosing the infrastructure, implementing load-balancing strategies to distribute incoming requests, version controlling, security and access controlling, compliance and regulations, maintaining proper documentation, continuous monitoring, gathering feedback, and updating it with new data. Fine-tuning and retraining may be necessary to maintain accuracy.

Data science is an ongoing journey of improvement. Therefore, we may need to stay adaptive by incorporating new techniques, data, and features to enhance the model's performance.


The data science journey, from problem definition to model deployment, is a meticulous and iterative process. It's where technology meets problem-solving, and with each step, we uncover valuable insights that can drive innovation and informed decision-making. Whether you're a seasoned data scientist or just starting this structured approach ensures you make the most of your data-driven endeavors.

Kalpa Senanayake

Solutions Architect at BIG W

1 年

Comprehensive introduction to the topic. I often find articles which jump straight to the model building part without explaining EDA and the rest of pre-model pipeline. This one takes the long route and takes time to build the full story.

要查看或添加评论,请登录

Harini Kolamunna, PhD的更多文章

社区洞察

其他会员也浏览了