SAS Viya for Learners Challenge 2024: A Deep Dive into My Approach

SAS Viya for Learners Challenge 2024: A Deep Dive into My Approach

I'm excited to share that I emerged as one of the winners of the SAS Viya for Learners Challenge 2024! You can find my notebook file here on GitHub. This competition was a fantastic learning experience, and in this article, I'll walk you through my journey, the approach I took, and the steps that led to my success.

Challenge Overview

In June 2024, I participated in the SAS Viya for Learners Challenge, a five-day competition designed to explore the AI lifecycle using SAS, Python, or R. Hosted through Kaggle InClass, this challenge was a hands-on opportunity for students to work through real-world machine learning problems.

The task was to build a machine learning model using the Home Equity Loans dataset (hmeq) to predict which individuals might default on a loan. The dataset included various features, including loan amounts, existing mortgage amounts, and delinquency status. The challenge involved walking through the data science process, from data preparation and exploration to model building and deployment.

Step 1: Data Exploration and Preprocessing

The first step was to explore the data and understand its structure. The hmeq_train.csv file contained the training dataset, while hmeq_test.csv served as the test dataset. Here's an overview of the steps I took during data exploration and preprocessing:

import pandas as pd

# Load the training data
train_df = pd.read_csv('hmeq_train.csv')

# Display the first few rows of the dataset
train_df.head()        

Reasoning:

  • Exploration: I started by loading the dataset into a pandas DataFrame to inspect its structure and contents. This initial exploration helped identify the types of features present, the number of missing values, and potential outliers.
  • Outcome: This step was crucial to forming an understanding of the data, allowing me to devise a preprocessing strategy.

Step 2: Data Cleaning and Feature Engineering

Upon inspecting the dataset, I noticed missing values and the presence of categorical variables. Cleaning the data and engineering new features were necessary steps to improve model performance.

# Fill missing values with median (for numerical columns)
train_df.fillna(train_df.median(), inplace=True)

# Convert categorical variables to numeric using one-hot encoding
train_df = pd.get_dummies(train_df, drop_first=True)        

Reasoning:

  • Missing Values: I used the median to fill missing values in numerical columns to minimize the impact of outliers.
  • Categorical Encoding: One-hot encoding was applied to convert categorical variables into numeric form, making them suitable for machine learning models.
  • Outcome: This preprocessing step ensured that the dataset was clean and ready for model training, reducing noise and improving feature quality.

Step 3: Building the Machine Learning Model

With the data cleaned and preprocessed, I experimented with various machine learning models, including logistic regression, random forests, and gradient boosting.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the data into training and validation sets
X = train_df.drop('default', axis=1)
y = train_df['default']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = rf_model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)        

Reasoning:

  • Model Selection: I selected the Random Forest classifier due to its ability to handle both numerical and categorical features, its robustness to overfitting, and its feature importance insights.
  • Validation: I used an 80-20 train-validation split to ensure the model's performance could be evaluated before making predictions on the test dataset.
  • Outcome: The Random Forest model performed well, achieving a high accuracy score on the validation set.

Step 4: Model Evaluation and Hyperparameter Tuning

To further enhance model performance, I performed hyperparameter tuning using grid search to find the optimal combination of parameters.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_        

Reasoning:

  • Hyperparameter Tuning: Grid search allowed me to systematically explore combinations of parameters to find the best-performing model.
  • Outcome: This step significantly improved the model's predictive accuracy, boosting its performance on the test set.

Step 5: Making Predictions and Submission

Finally, I used the tuned model to make predictions on the test dataset and prepared the submission file for the Kaggle competition.

# Load the test data
test_df = pd.read_csv('hmeq_test.csv')

# Apply the same preprocessing to the test data
test_df.fillna(train_df.median(), inplace=True)
test_df = pd.get_dummies(test_df, drop_first=True)

# Align test data with training data
X_test = test_df.reindex(columns=X.columns, fill_value=0)

# Make predictions on the test data
test_predictions = best_model.predict(X_test)

# Create the submission file
submission = pd.DataFrame({'id': test_df['id'], 'default': test_predictions})
submission.to_csv('submission.csv', index=False)        

Reasoning:

  • Consistency: I applied the same preprocessing steps to the test dataset to ensure consistency between training and testing.
  • Submission: The predictions were formatted according to the competition requirements and submitted to the leaderboard.
  • Outcome: My model achieved a high accuracy score, securing me a spot among the winners of the competition!

Final Results and Model Evaluation

  • Final Accuracy on Test Set: 0.95931
  • Stacking Classifier Accuracy: 0.9519230769230769

Confusion Matrix:

[[1242    8]
 [  67  243]]        

Classification Report:

                          precision    recall  f1-score   support

           0                 0.95         0.99      0.97      1250
           1                  0.97         0.78      0.87       310

    accuracy                                         0.95      1560
   macro avg          0.96        0.89      0.92      1560
weighted avg        0.95        0.95      0.95      1560        

  • Cross-Validation Scores: [0.95192308, 0.97019231, 0.95865385, 0.95668912, 0.96246391]
  • Mean CV Score: 0.959984452506108
  • Standard Deviation of CV Score: 0.006131288340315996

Outcome: These results indicate that the model was highly accurate, both during validation and on the test set. The high precision and recall scores, particularly for the non-default class, demonstrate the model's effectiveness in predicting loan defaults. The consistency across cross-validation scores further confirms the robustness of the approach.

Conclusion

Participating in the SAS Viya for Learners Challenge was a fantastic learning experience. It allowed me to apply my machine learning skills to a real-world problem, from data exploration and preprocessing to model building and evaluation. The key takeaway was the importance of systematic data preparation, model selection, and hyperparameter tuning in achieving optimal results.

For those interested in a deeper dive into my code and approach, you can find my notebook here on GitHub. Feel free to reach out if you have any questions or would like to discuss the challenge further!

Soundaryan K

Software Engineer at Ducen | Web Development | Full Stack | .Net Core | Web API | C# | Javascript | Angular | React | HTML | CSS | Jenkins | Git

6 个月

Congrats Tayalarajan!

回复
Madhu Mitha Ravi Shankar

Talent Whisperer|Empowering talent|Bridging the Gap Between People and Success

6 个月

Well done Tayalarajan

Sivaselvan Anbarasu

ServiceNow Developer

6 个月

Congrats Tayalarajan!

Shubham Wankhede

Master of Banking and Finance (Financial Management) | CFA Level 1 Candidate

6 个月

Congratulations!

Akash A.

Software Engineer | ServiceNow Developer | HRSD & ITSM Specialist | Integration & Workflow Automation Expert

6 个月

Well done Tayalarajan

要查看或添加评论,请登录

Tayalarajan Ramanujadurai的更多文章