Classification Models in Machine Learning
Ravi Kiran Reddy Chinta
Data Scientist | Machine Learning Engineer | Passionate about Data-Driven Solutions and Advanced Analytics
Classification plays a pivotal role in data analysis, enabling data organization into predefined categories. This facilitates the interpretation and analysis of data, making it more manageable and insightful.
In a business context, classification can be instrumental in identifying customer segments, predicting customer behavior, and detecting fraudulent transactions. Accurate data classification enhances decision-making processes and improves operational efficiency, ultimately contributing to more effective business strategies and outcomes.
?We can classify models into two or more classes:
Binary Classification: The model outputs one of two possible classes. Examples include spam vs. not spam in email filtering and disease vs. no disease in medical diagnosis.
Multiclass Classification: The model outputs one of three or more possible classes. Examples include categorizing a news article into topics like politics, sports, or entertainment, or classifying an image such as a cat, dog, or bird.
Classification has a wide range of? applications in our day-to-day life like Sentiment Analysis, Healthcare, Finance, Marketing, and Role in Decision-Making Processes ( routing queries to the appropriate department)
Types of Classification Models
Logistic Regression
Logistic Regression is a linear model used for binary classification tasks. It estimates the probability of a class label using a logistic function.
Parameters:
Support Vector Machines (SVM)
SVMs find the hyperplane that best separates the classes in the feature space. They work well for both linear and non-linear classification tasks using kernel functions.
Parameters:
In the case of multi-classification, A separate SVM classifier is trained for each class, treating that class as positive and all other classes as negative.
Decision Trees
Decision Trees split the data into subsets based on the value of input features, leading to a tree-like model of decisions and their possible consequences.
Pruning is used in decision tree models to prevent overfitting and improve regularisation. The process is to cut branches of the decision trees after they are fully grown to capture noise. They will minimize the cost function by removing each subtree and replacing it with a leaf node(average value for regression, mode for classification).
Parameters:
Min samples split and leaf: Higher values can prevent the model from learning overly specific patterns (reducing overfitting), but too high values might underfit.
Random Forest
Basic Concept: Random Forest is an ensemble method that builds multiple decision trees and merges them to improve classification accuracy and control overfitting.
Parameters:
Gradient Boosting Machines (GBM)
It begins by initializing the ensemble with a single model, typically a decision tree. Then, it iteratively adds new models that predict the residual errors made by the previous model. Models are added sequentially until a stopping criterion is met. By combining several weak models, the prediction accuracy is improved since subsequent models correct errors made by one model.
Key Parameters:
XGBoost
XGBoost uses a more regularized model formalization to control overfitting, handle missing values, and achieve computational speedups. XGBoost can offer advantages on smaller datasets where regularization and feature importance insights matter more.
?
?
?
LightGBM
?LightGBM uses gradient-based one-side sampling and exclusive feature bundling to filter out data instances effectively and features that are less useful for improving the model. LightGBM excels on large datasets where training speed is critical, and accuracy from techniques like GOSS shines through.
?
?
Both algorithms treat missing values by assigning them to the side that reduces loss the most in each split.
KNN
As the name suggests, this model calculates the dependent variable using values of the nearest neighbors (average for regression; mode for classification).
Parameters:
领英推荐
Naive Bayes
Naive Bayes is based on Bayes' theorem assuming that features are independent given the class label. The main assumption is that it takes all the events as independent events. Despite this assumption, it performs surprisingly well in practice, especially for text classification tasks.
Parameters:
Alpha (Laplace Smoothing): Smooth the probability estimates. Useful to handle zero probabilities in categorical data. Smoothing helps with categorical features, especially when some classes might have zero occurrences in the training data.?
This model is generally used for Text classification and sentiment analysis.
Performs well with high-dimensional data (e.g., images, text).
Model Evaluation Metrics
Confusion Matrix
A table showing the number of true positives, true negatives, false positives, and false negatives.
From the confusion matrix data, we can learn the following:
1. Accuracy
The proportion of correctly predicted instances out of the total instances.
Accuracy = (True?Positives+ True?Negative)/total
It is useful when classes are balanced but can be misleading with imbalanced datasets.
2. Precision
The proportion of true positive predictions out of all positive predictions made by the model.
Precision = (True?Positives+False?Positives)/True?Positives
It is important when the cost of false positives is high. For example, high precision ensures that legitimate emails are not misclassified as spam in spam detection.
3. Recall (Sensitivity or True Positive Rate)
The proportion of true positive predictions out of all actual positives.
Recall= (True?Positives+False?Negatives) / True?Positives
It is important when the cost of false negatives is high. For example, high recall ensures that most patients with a condition are identified in medical diagnoses.
4. Specificity (True Negative Rate)
The proportion of true negative predictions out of all actual negatives.
Specificity= (True?Negatives+False?Positives) /True?Negatives
It is important when the cost of false positives is significant.?
5. F1 Score
The harmonic mean of precision and recall provides a single metric that balances both.
F1?Score=2×(Precision+Recall)/(Precision×Recall)
Useful when balancing precision and recall, especially with imbalanced datasets.
AUC-ROC (Area Under the ROC Curve)
Accuracy is the primary measure used to calculate the metrics. We should also look for other metrics suitable for the given problem.
Challenges in Classification
Overfitting occurs when a model learns the noise in the training data rather than the underlying pattern, resulting in poor performance on test data
Underfitting happens when a model is too simple to capture the data's underlying trend, leading to poor training and test data performance.
Techniques like regularization, cross-validation, pruning, and early stopping can be used to mitigate overfitting.
Addressing underfitting may involve increasing model complexity, feature engineering, decreasing regularization, or extending training time.
Imbalanced datasets, where one class significantly outweighs another, can skew model performance by causing bias toward the majority class. Solutions include resampling techniques, adjusting class weights, generating synthetic data, and detecting anomalies. We can use the smote method( here, we get the difference of minority class points and multiply with a random number between (0,1) and add to the previous feature).
Feature selection improves model performance by focusing on relevant features and reducing overfitting, using methods like filter, wrapper, and embedded techniques.
Feature engineering further enhances model accuracy by creating new features, handling missing values, and normalizing data.
Conclusion
In summary, we discussed the basic intuition behind each model and the parameters that affect the model's performance.
Classification models are crucial for organizing data into predefined categories, enhancing decision-making and operational efficiency across various domains.
From binary and multiclass classification tasks to specialized algorithms like logistic regression, SVM, decision trees, random forests, and gradient boosting machines, each method offers unique strengths and is suited to different types of problems and datasets.
Techniques like XGBoost and LightGBM further refine model performance, with XGBoost excelling in speed and accuracy and LightGBM offering impressive speed and efficiency for large datasets.
Key evaluation metrics such as accuracy, precision, recall, and the ROC curve are essential for assessing model performance and ensuring the models meet specific needs.
Addressing challenges like overfitting, underfitting, and imbalanced datasets through appropriate strategies can significantly enhance model effectiveness.
Ultimately, the choice of classification model and evaluation metrics should align with the problem and data characteristics, ensuring that the resulting insights drive more informed and effective business strategies.
Questions to think about:
What is the difference between classification and regression?
How do I choose the suitable classification model for my data?
Can classification models be used for unsupervised learning tasks?