登录查看更多内容

SAS Viya for Learners Challenge 2024: A Deep Dive into My Approach

Tayalarajan Ramanujadurai

Building Scalable AI Solutions | M.S. in Data Science | Specializing in Gen AI, LLMs, RAG, AI Agents | 2 x AWS Certified | 1000+ LeetCode Problems Solved | (AI+Cloud+Full Stack) Development

发布日期: 2024年9月18日

I'm excited to share that I emerged as one of the winners of the SAS Viya for Learners Challenge 2024! You can find my notebook file here on GitHub. This competition was a fantastic learning experience, and in this article, I'll walk you through my journey, the approach I took, and the steps that led to my success.

Challenge Overview

In June 2024, I participated in the SAS Viya for Learners Challenge, a five-day competition designed to explore the AI lifecycle using SAS, Python, or R. Hosted through Kaggle InClass, this challenge was a hands-on opportunity for students to work through real-world machine learning problems.

The task was to build a machine learning model using the Home Equity Loans dataset (hmeq) to predict which individuals might default on a loan. The dataset included various features, including loan amounts, existing mortgage amounts, and delinquency status. The challenge involved walking through the data science process, from data preparation and exploration to model building and deployment.

Step 1: Data Exploration and Preprocessing

The first step was to explore the data and understand its structure. The hmeq_train.csv file contained the training dataset, while hmeq_test.csv served as the test dataset. Here's an overview of the steps I took during data exploration and preprocessing:

import pandas as pd

# Load the training data
train_df = pd.read_csv('hmeq_train.csv')

# Display the first few rows of the dataset
train_df.head()

Reasoning:

Exploration: I started by loading the dataset into a pandas DataFrame to inspect its structure and contents. This initial exploration helped identify the types of features present, the number of missing values, and potential outliers.
Outcome: This step was crucial to forming an understanding of the data, allowing me to devise a preprocessing strategy.

Step 2: Data Cleaning and Feature Engineering

Upon inspecting the dataset, I noticed missing values and the presence of categorical variables. Cleaning the data and engineering new features were necessary steps to improve model performance.

# Fill missing values with median (for numerical columns)
train_df.fillna(train_df.median(), inplace=True)

# Convert categorical variables to numeric using one-hot encoding
train_df = pd.get_dummies(train_df, drop_first=True)

Reasoning:

Missing Values: I used the median to fill missing values in numerical columns to minimize the impact of outliers.
Categorical Encoding: One-hot encoding was applied to convert categorical variables into numeric form, making them suitable for machine learning models.
Outcome: This preprocessing step ensured that the dataset was clean and ready for model training, reducing noise and improving feature quality.

Step 3: Building the Machine Learning Model

With the data cleaned and preprocessed, I experimented with various machine learning models, including logistic regression, random forests, and gradient boosting.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the data into training and validation sets
X = train_df.drop('default', axis=1)
y = train_df['default']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = rf_model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)

Reasoning:

Model Selection: I selected the Random Forest classifier due to its ability to handle both numerical and categorical features, its robustness to overfitting, and its feature importance insights.
Validation: I used an 80-20 train-validation split to ensure the model's performance could be evaluated before making predictions on the test dataset.
Outcome: The Random Forest model performed well, achieving a high accuracy score on the validation set.

Step 4: Model Evaluation and Hyperparameter Tuning

To further enhance model performance, I performed hyperparameter tuning using grid search to find the optimal combination of parameters.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

Reasoning:

Hyperparameter Tuning: Grid search allowed me to systematically explore combinations of parameters to find the best-performing model.
Outcome: This step significantly improved the model's predictive accuracy, boosting its performance on the test set.

Step 5: Making Predictions and Submission

Finally, I used the tuned model to make predictions on the test dataset and prepared the submission file for the Kaggle competition.

# Load the test data
test_df = pd.read_csv('hmeq_test.csv')

# Apply the same preprocessing to the test data
test_df.fillna(train_df.median(), inplace=True)
test_df = pd.get_dummies(test_df, drop_first=True)

# Align test data with training data
X_test = test_df.reindex(columns=X.columns, fill_value=0)

# Make predictions on the test data
test_predictions = best_model.predict(X_test)

# Create the submission file
submission = pd.DataFrame({'id': test_df['id'], 'default': test_predictions})
submission.to_csv('submission.csv', index=False)

Reasoning:

Consistency: I applied the same preprocessing steps to the test dataset to ensure consistency between training and testing.
Submission: The predictions were formatted according to the competition requirements and submitted to the leaderboard.
Outcome: My model achieved a high accuracy score, securing me a spot among the winners of the competition!

Final Results and Model Evaluation

Final Accuracy on Test Set: 0.95931
Stacking Classifier Accuracy: 0.9519230769230769

Confusion Matrix:

[[1242    8]
 [  67  243]]

Classification Report:

                          precision    recall  f1-score   support

           0                 0.95         0.99      0.97      1250
           1                  0.97         0.78      0.87       310

    accuracy                                         0.95      1560
   macro avg          0.96        0.89      0.92      1560
weighted avg        0.95        0.95      0.95      1560

Cross-Validation Scores: [0.95192308, 0.97019231, 0.95865385, 0.95668912, 0.96246391]
Mean CV Score: 0.959984452506108
Standard Deviation of CV Score: 0.006131288340315996

Outcome: These results indicate that the model was highly accurate, both during validation and on the test set. The high precision and recall scores, particularly for the non-default class, demonstrate the model's effectiveness in predicting loan defaults. The consistency across cross-validation scores further confirms the robustness of the approach.

Conclusion

Participating in the SAS Viya for Learners Challenge was a fantastic learning experience. It allowed me to apply my machine learning skills to a real-world problem, from data exploration and preprocessing to model building and evaluation. The key takeaway was the importance of systematic data preparation, model selection, and hyperparameter tuning in achieving optimal results.

For those interested in a deeper dive into my code and approach, you can find my notebook here on GitHub. Feel free to reach out if you have any questions or would like to discuss the challenge further!

Soundaryan K

6 个月

Congrats Tayalarajan!

Madhu Mitha Ravi Shankar

Talent Whisperer|Empowering talent|Bridging the Gap Between People and Success

6 个月

Well done Tayalarajan

1 次回应

Sivaselvan Anbarasu

ServiceNow Developer

6 个月

Congrats Tayalarajan!

1 次回应

Shubham Wankhede

Master of Banking and Finance (Financial Management) | CFA Level 1 Candidate

6 个月

Congratulations!

1 次回应

Akash A.

Software Engineer | ServiceNow Developer | HRSD & ITSM Specialist | Integration & Workflow Automation Expert

6 个月

Well done Tayalarajan

1 次回应

查看更多评论

要查看或添加评论，请登录

Tayalarajan Ramanujadurai的更多文章

Student Life Experience: Balancing Academics, Time, and Money

2024年10月14日

Student Life Experience: Balancing Academics, Time, and Money

Introduction Moving to a new country to pursue higher education is an exciting, yet challenging journey. As an…

12 条评论
Work remotely and earn in $USD from anywhere in the world!

2024年9月12日

Work remotely and earn in $USD from anywhere in the world!

Many skilled professionals are unaware of the remote working opportunities available in the U.S.

6 条评论
Path to AWS Machine Learning Specialty Certification: Lessons and Insights

2024年9月4日

Path to AWS Machine Learning Specialty Certification: Lessons and Insights

Why I Chose the AWS Machine Learning Specialty Certification As someone with over 3 years of experience as a software…

11 条评论
How I Secured an Internship in First Semester as an International Student

2024年7月8日

How I Secured an Internship in First Semester as an International Student

Finding an internship as an international student, especially in your field of study, can be challenging. Internships…

5 条评论
Build a Strong Foundation in Data Structures and Algorithms with This Curated List of 60 LeetCode Problems! ??

2024年5月17日

Build a Strong Foundation in Data Structures and Algorithms with This Curated List of 60 LeetCode Problems! ??

Introduction Welcome to your ultimate guide to building a strong foundation in Data Structures and Algorithms (DSA)…

5 条评论

See all articles

Challenge Overview

Step 1: Data Exploration and Preprocessing

Step 2: Data Cleaning and Feature Engineering

Step 3: Building the Machine Learning Model

Step 4: Model Evaluation and Hyperparameter Tuning

Step 5: Making Predictions and Submission

Final Results and Model Evaluation

Conclusion

Tayalarajan Ramanujadurai的更多文章

Student Life Experience: Balancing Academics, Time, and Money

Work remotely and earn in $USD from anywhere in the world!

Path to AWS Machine Learning Specialty Certification: Lessons and Insights

How I Secured an Internship in First Semester as an International Student

Build a Strong Foundation in Data Structures and Algorithms with This Curated List of 60 LeetCode Problems! ??