Classification Models in Machine Learning

Classification Models in Machine Learning

Classification plays a pivotal role in data analysis, enabling data organization into predefined categories. This facilitates the interpretation and analysis of data, making it more manageable and insightful.

In a business context, classification can be instrumental in identifying customer segments, predicting customer behavior, and detecting fraudulent transactions. Accurate data classification enhances decision-making processes and improves operational efficiency, ultimately contributing to more effective business strategies and outcomes.

?We can classify models into two or more classes:

Binary Classification: The model outputs one of two possible classes. Examples include spam vs. not spam in email filtering and disease vs. no disease in medical diagnosis.

Multiclass Classification: The model outputs one of three or more possible classes. Examples include categorizing a news article into topics like politics, sports, or entertainment, or classifying an image such as a cat, dog, or bird.

Classification has a wide range of? applications in our day-to-day life like Sentiment Analysis, Healthcare, Finance, Marketing, and Role in Decision-Making Processes ( routing queries to the appropriate department)



Types of Classification Models


Logistic Regression

Logistic Regression is a linear model used for binary classification tasks. It estimates the probability of a class label using a logistic function.


Parameters:

  1. Regularization (C): Controls the strength of the regularization term. A smaller C value increases regularization, which can prevent overfitting but might underfit. A more significant C value decreases regularization, potentially improving fit but risking overfitting.
  2. Solver: The algorithm used to optimize the model. Standard options include 'liblinear', 'newton-cg', 'lbfgs', etc. Different solvers may have different convergence rates and accuracy. The choice of solver can impact the speed and quality of the fit.




Support Vector Machines (SVM)

SVMs find the hyperplane that best separates the classes in the feature space. They work well for both linear and non-linear classification tasks using kernel functions.


Parameters:

  1. C (Regularization): Similar to logistic regression, it controls the trade-off between maximizing the margin and minimizing classification error. Higher C values lead to a more complex model, potentially overfitting, while lower C values lead to a simpler model, potentially underfitting.
  2. Kernel: Type of kernel function ('linear', 'poly', 'rbf', etc.) that transforms the data into a higher-dimensional space. The choice of kernel affects how well the model can handle non-linear data. Different kernels transform the data differently.



In the case of multi-classification, A separate SVM classifier is trained for each class, treating that class as positive and all other classes as negative.



Decision Trees

Decision Trees split the data into subsets based on the value of input features, leading to a tree-like model of decisions and their possible consequences.


Pruning is used in decision tree models to prevent overfitting and improve regularisation. The process is to cut branches of the decision trees after they are fully grown to capture noise. They will minimize the cost function by removing each subtree and replacing it with a leaf node(average value for regression, mode for classification).

Parameters:

  1. Max Depth: Limits the number of levels in the tree. It helps control overfitting. Deeper trees can model more complex patterns but risk overfitting. Shallower trees might underfit.
  2. Min Samples Split: Minimum number of samples required to split an internal node.
  3. Min Samples Leaf: Minimum number of samples required to be at a leaf node.

Min samples split and leaf: Higher values can prevent the model from learning overly specific patterns (reducing overfitting), but too high values might underfit.




Random Forest

Basic Concept: Random Forest is an ensemble method that builds multiple decision trees and merges them to improve classification accuracy and control overfitting.


Parameters:

  1. Number of Trees (n_estimators): Number of trees in the forest. More trees generally improve the model’s performance but increase computational cost.
  2. Max Features: Number of features to consider for the best split. Using fewer features at each split can reduce overfitting but may increase bias.
  3. Max Depth: Limits the number of levels in the tree. It helps control overfitting. Deeper trees can model more complex patterns but risk overfitting. Shallower trees might underfit.




Gradient Boosting Machines (GBM)

It begins by initializing the ensemble with a single model, typically a decision tree. Then, it iteratively adds new models that predict the residual errors made by the previous model. Models are added sequentially until a stopping criterion is met. By combining several weak models, the prediction accuracy is improved since subsequent models correct errors made by one model.

Key Parameters:

  1. Learning Rate: Determines the step size at each iteration while moving towards a minimum of the loss function. Lower learning rates require more boosting stages but can lead to better model performance and generalization.
  2. Number of Estimators: Number of boosting stages to be run. More stages can improve accuracy but may also lead to overfitting. Fewer stages might result in underfitting.
  3. Max Depth: Maximum depth of the individual trees in the ensemble. Controls the complexity of each tree. Shallower trees reduce overfitting but might underfit.






XGBoost

XGBoost uses a more regularized model formalization to control overfitting, handle missing values, and achieve computational speedups. XGBoost can offer advantages on smaller datasets where regularization and feature importance insights matter more.

?

  • XGBoost?has become an extremely popular gradient-boosting library known for its speed and performance. Some key advantages of XGBoost include:
  • High predictive accuracy due to regularization that reduces overfitting
  • Native support for parallel and distributed computing for fast model training
  • Handling sparse data well
  • Model interpretability tools like feature importance
  • XGBoost scales better for extremely large datasets with hundreds of features or examples in the millions.
  • Although XGBoost is comparatively slower than LightGBM on GPU, it is faster on CPU

?

?

LightGBM

?LightGBM uses gradient-based one-side sampling and exclusive feature bundling to filter out data instances effectively and features that are less useful for improving the model. LightGBM excels on large datasets where training speed is critical, and accuracy from techniques like GOSS shines through.

  • LightGBM?offers a faster training speed and lower memory usage than XGBoost while achieving competitive accuracy. Some of its strengths are:
  • It is less interpretable as the model is complex
  • Faster training on large datasets with high dimensionality but slow while predicting
  • Lower memory usage with its unique leaf-wise growth decision tree algorithm
  • The leaf-wise splitting approach can sometimes lead to overfitting, especially with smaller datasets.
  • LightGBM for small to medium-sized datasets
  • Use LightGBM as the default choice for high-speed requirements and large data scenarios (hundreds of millions of samples or features)
  • Support for both parallel learning and GPU learning

?

?

Both algorithms treat missing values by assigning them to the side that reduces loss the most in each split.




KNN

As the name suggests, this model calculates the dependent variable using values of the nearest neighbors (average for regression; mode for classification).


Parameters:

  • Number of Neighbors (k): No of neighbors to use for classification. Using the elbow rule, we can plot the accuracy and n neighbors and check the best-suited point.


  • Distance Metric: How distance between data points is calculated (e.g., Euclidean, Manhattan).




Naive Bayes

Naive Bayes is based on Bayes' theorem assuming that features are independent given the class label. The main assumption is that it takes all the events as independent events. Despite this assumption, it performs surprisingly well in practice, especially for text classification tasks.


Parameters:

Alpha (Laplace Smoothing): Smooth the probability estimates. Useful to handle zero probabilities in categorical data. Smoothing helps with categorical features, especially when some classes might have zero occurrences in the training data.?

This model is generally used for Text classification and sentiment analysis.

  • Simple and easy to implement.
  • Requires a small amount of training data.
  • Efficient in terms of both time and space.

Performs well with high-dimensional data (e.g., images, text).




Model Evaluation Metrics


Confusion Matrix

A table showing the number of true positives, true negatives, false positives, and false negatives.



From the confusion matrix data, we can learn the following:

1. Accuracy

The proportion of correctly predicted instances out of the total instances.

Accuracy = (True?Positives+ True?Negative)/total

It is useful when classes are balanced but can be misleading with imbalanced datasets.


2. Precision

The proportion of true positive predictions out of all positive predictions made by the model.

Precision = (True?Positives+False?Positives)/True?Positives

It is important when the cost of false positives is high. For example, high precision ensures that legitimate emails are not misclassified as spam in spam detection.


3. Recall (Sensitivity or True Positive Rate)

The proportion of true positive predictions out of all actual positives.

Recall= (True?Positives+False?Negatives) / True?Positives

It is important when the cost of false negatives is high. For example, high recall ensures that most patients with a condition are identified in medical diagnoses.


4. Specificity (True Negative Rate)

The proportion of true negative predictions out of all actual negatives.

Specificity= (True?Negatives+False?Positives) /True?Negatives

It is important when the cost of false positives is significant.?


5. F1 Score

The harmonic mean of precision and recall provides a single metric that balances both.

F1?Score=2×(Precision+Recall)/(Precision×Recall)

Useful when balancing precision and recall, especially with imbalanced datasets.


AUC-ROC (Area Under the ROC Curve)

  • Compares the relationship between the True Positive Rate (TPR) and the False Positive Rate (FPR) across different cut-off thresholds. AUC-ROC measures a model's sensitivity and specificity and performs well for imbalanced datasets.
  • The area under the ROC curve represents the model’s ability to distinguish between classes.
  • Range: 0 to 1, where 1 indicates a perfect model and 0.5 indicates no discrimination ability.
  • Provides a summary of the model's performance across all thresholds.


Accuracy is the primary measure used to calculate the metrics. We should also look for other metrics suitable for the given problem.



Challenges in Classification

Overfitting occurs when a model learns the noise in the training data rather than the underlying pattern, resulting in poor performance on test data

Underfitting happens when a model is too simple to capture the data's underlying trend, leading to poor training and test data performance.

Techniques like regularization, cross-validation, pruning, and early stopping can be used to mitigate overfitting.

Addressing underfitting may involve increasing model complexity, feature engineering, decreasing regularization, or extending training time.

Imbalanced datasets, where one class significantly outweighs another, can skew model performance by causing bias toward the majority class. Solutions include resampling techniques, adjusting class weights, generating synthetic data, and detecting anomalies. We can use the smote method( here, we get the difference of minority class points and multiply with a random number between (0,1) and add to the previous feature).


Feature selection improves model performance by focusing on relevant features and reducing overfitting, using methods like filter, wrapper, and embedded techniques.


Feature engineering further enhances model accuracy by creating new features, handling missing values, and normalizing data.


Conclusion

In summary, we discussed the basic intuition behind each model and the parameters that affect the model's performance.

Classification models are crucial for organizing data into predefined categories, enhancing decision-making and operational efficiency across various domains.

From binary and multiclass classification tasks to specialized algorithms like logistic regression, SVM, decision trees, random forests, and gradient boosting machines, each method offers unique strengths and is suited to different types of problems and datasets.

Techniques like XGBoost and LightGBM further refine model performance, with XGBoost excelling in speed and accuracy and LightGBM offering impressive speed and efficiency for large datasets.

Key evaluation metrics such as accuracy, precision, recall, and the ROC curve are essential for assessing model performance and ensuring the models meet specific needs.

Addressing challenges like overfitting, underfitting, and imbalanced datasets through appropriate strategies can significantly enhance model effectiveness.

Ultimately, the choice of classification model and evaluation metrics should align with the problem and data characteristics, ensuring that the resulting insights drive more informed and effective business strategies.


Questions to think about:

What is the difference between classification and regression?

How do I choose the suitable classification model for my data?

Can classification models be used for unsupervised learning tasks?

要查看或添加评论,请登录

Ravi Kiran Reddy Chinta的更多文章

  • Machine Learning

    Machine Learning

    Topics: Regression models Metrics optimizers overfitting and underfitting Regression Models Linear Regression: Linear…

    4 条评论
  • Determining Air Quality Index Using ML Models in Python

    Determining Air Quality Index Using ML Models in Python

    In our rapidly urbanizing world, monitoring and predicting air quality is essential for public health. The Air Quality…

    2 条评论

社区洞察

其他会员也浏览了