SAS Viya for Learners Challenge 2024: A Deep Dive into My Approach
Tayalarajan Ramanujadurai
Building Scalable AI Solutions | M.S. in Data Science | Specializing in Gen AI, LLMs, RAG, AI Agents | 2 x AWS Certified | 1000+ LeetCode Problems Solved | (AI+Cloud+Full Stack) Development
I'm excited to share that I emerged as one of the winners of the SAS Viya for Learners Challenge 2024! You can find my notebook file here on GitHub. This competition was a fantastic learning experience, and in this article, I'll walk you through my journey, the approach I took, and the steps that led to my success.
Challenge Overview
In June 2024, I participated in the SAS Viya for Learners Challenge, a five-day competition designed to explore the AI lifecycle using SAS, Python, or R. Hosted through Kaggle InClass, this challenge was a hands-on opportunity for students to work through real-world machine learning problems.
The task was to build a machine learning model using the Home Equity Loans dataset (hmeq) to predict which individuals might default on a loan. The dataset included various features, including loan amounts, existing mortgage amounts, and delinquency status. The challenge involved walking through the data science process, from data preparation and exploration to model building and deployment.
Step 1: Data Exploration and Preprocessing
The first step was to explore the data and understand its structure. The hmeq_train.csv file contained the training dataset, while hmeq_test.csv served as the test dataset. Here's an overview of the steps I took during data exploration and preprocessing:
import pandas as pd
# Load the training data
train_df = pd.read_csv('hmeq_train.csv')
# Display the first few rows of the dataset
train_df.head()
Reasoning:
Step 2: Data Cleaning and Feature Engineering
Upon inspecting the dataset, I noticed missing values and the presence of categorical variables. Cleaning the data and engineering new features were necessary steps to improve model performance.
# Fill missing values with median (for numerical columns)
train_df.fillna(train_df.median(), inplace=True)
# Convert categorical variables to numeric using one-hot encoding
train_df = pd.get_dummies(train_df, drop_first=True)
Reasoning:
Step 3: Building the Machine Learning Model
With the data cleaned and preprocessed, I experimented with various machine learning models, including logistic regression, random forests, and gradient boosting.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Split the data into training and validation sets
X = train_df.drop('default', axis=1)
y = train_df['default']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Make predictions on the validation set
y_pred = rf_model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
Reasoning:
Step 4: Model Evaluation and Hyperparameter Tuning
To further enhance model performance, I performed hyperparameter tuning using grid search to find the optimal combination of parameters.
from sklearn.model_selection import GridSearchCV
# Define the parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
# Perform grid search
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
# Best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
Reasoning:
Step 5: Making Predictions and Submission
Finally, I used the tuned model to make predictions on the test dataset and prepared the submission file for the Kaggle competition.
# Load the test data
test_df = pd.read_csv('hmeq_test.csv')
# Apply the same preprocessing to the test data
test_df.fillna(train_df.median(), inplace=True)
test_df = pd.get_dummies(test_df, drop_first=True)
# Align test data with training data
X_test = test_df.reindex(columns=X.columns, fill_value=0)
# Make predictions on the test data
test_predictions = best_model.predict(X_test)
# Create the submission file
submission = pd.DataFrame({'id': test_df['id'], 'default': test_predictions})
submission.to_csv('submission.csv', index=False)
Reasoning:
Final Results and Model Evaluation
Confusion Matrix:
[[1242 8]
[ 67 243]]
Classification Report:
precision recall f1-score support
0 0.95 0.99 0.97 1250
1 0.97 0.78 0.87 310
accuracy 0.95 1560
macro avg 0.96 0.89 0.92 1560
weighted avg 0.95 0.95 0.95 1560
Outcome: These results indicate that the model was highly accurate, both during validation and on the test set. The high precision and recall scores, particularly for the non-default class, demonstrate the model's effectiveness in predicting loan defaults. The consistency across cross-validation scores further confirms the robustness of the approach.
Conclusion
Participating in the SAS Viya for Learners Challenge was a fantastic learning experience. It allowed me to apply my machine learning skills to a real-world problem, from data exploration and preprocessing to model building and evaluation. The key takeaway was the importance of systematic data preparation, model selection, and hyperparameter tuning in achieving optimal results.
For those interested in a deeper dive into my code and approach, you can find my notebook here on GitHub. Feel free to reach out if you have any questions or would like to discuss the challenge further!
Software Engineer at Ducen | Web Development | Full Stack | .Net Core | Web API | C# | Javascript | Angular | React | HTML | CSS | Jenkins | Git
6 个月Congrats Tayalarajan!
Talent Whisperer|Empowering talent|Bridging the Gap Between People and Success
6 个月Well done Tayalarajan
ServiceNow Developer
6 个月Congrats Tayalarajan!
Master of Banking and Finance (Financial Management) | CFA Level 1 Candidate
6 个月Congratulations!
Software Engineer | ServiceNow Developer | HRSD & ITSM Specialist | Integration & Workflow Automation Expert
6 个月Well done Tayalarajan