登录查看更多内容

Income Prediction Challenge

Florence Mbabazi

Web Developer | Data Analyst | Advanced Excel | PowerBI | SQL | Data Visualization | Python | Machine Learning

发布日期: 2024年1月24日

Income inequality - when income is distributed unevenly among a population - is a growing problem in developing nations across the world. With the rapid rise of AI and worker automation, this problem could continue to grow if steps are not taken to address the issue.

The outline for this project will follow the CRISP-DM framework stage. The outline is as follows:

Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment

In addition to following the framework as a reference, I will also address the following aspects:

Hugging Face App

Docker Image

FastAPI

Now let us begin with the project.

1. Business Understanding

The objective of this challenge is to create a machine-learning model to predict whether an individual earns above or below a certain amount.

This solution can potentially reduce the cost and improve the accuracy of monitoring key population indicators such as income level in between census years. This information will help policymakers to better manage and avoid income inequality globally.

Objectives

Develop a predictive model to categorize individuals based on income levels.
Reduce the cost and improve the accuracy of monitoring key population indicators.
Assist policymakers in managing and mitigating global income inequality.

2. Data Understanding

Data Collection

In this project, we identified and acquired relevant datasets containing information on income and related features. The dataset resides on the Zindi platform, and for ease of access and security, we have made the datasets available in the GitHub Repository.

Before we explore each of the datasets we will want to know what the columns in the dataset represent.

Column names and description

age, Age Of Individual

gender, Gender

education, Education

class, Class Of Worker

education_institute, Enrolled Educational Institution in last week

marital_status,Marital_Status

race, Race

is_hispanic, Hispanic Origin

employment_commitment, Full Or Part Time Employment Stat

unemployment_reason, Reason For Unemployment

employment_stat, Has Own Business Or Is Self Employed

wage_per_hour, Wage Per Hour

is_labor_union, Member Of A Labor Union

working_week_per_year, Weeks Worked In A Year

industry_code,Category

industry_code_main, Major Industry Code

occupation_code,Category

occupation_code_main, Major Occupation Code

total_employed, Num Persons Worked For Employer

household_stat, Detailed Household And Family Stat

household_summary, Detailed Household Summary In Household

under_18_family, Family Members Under 18

veterans_admin_questionnaire, Fill Inc Questionnaire For Veterans Admin

vet_benefit,Benefits

tax_status, Tax Filer Status

gains, Gains

losses, Losses

stocks_status, Dividends From Stocks

citizenship, Citizenship

mig_year,Year

country_of_birth_own, Individual's Birth Country

country_of_birth_father, Father's Birth Country

country_of_birth_mother, Mother's Birth Country

migration_code_change_in_msa, Migration Code Change In Msa

migration_prev_sunbelt, Migration Prev Sunbelt

migration_code_move_within_reg, Migration Code Move Within Reg

migration_code_change_in_reg, Migration Code Change In Reg

residence_1_year_ago, Live In This House Year Ago?

old_residence_reg, Region Of Previous Residence

old_residence_state, State Of Previous Residence

importance_of_record, Weight Of Instance

income_above_limit,Is income above 50k?

Exploratory Data Analysis

The following section presents the exploratory data analysis (EDA) conducted on the different datasets. We will initiate the analysis with the train dataset

In this phase, we will explore the data by examining the dataset's structure to understand its features and variables. We will also identify missing values, outliers, and potential data quality issues. Additionally, descriptive statistics will be performed to gain initial insights into the data.

Libraries Used:

领英推荐

Are we chasing our tails?

Theodora Lau 1 年前

Global Population is Shrinking. But Depopulation is…

Dipankar "Dada" Khasnabish 4 个月前

Leveraging the HUBZone Program to Maximize the Impact…

Chinedu Echeruo 10 个月前

I used these libraries for preprocessing and visualizations

Data Overview and EDA

In this section, we will explore the data to see if there are any inconsistencies or irregularities that may be present.

Checking for Missing Values

The Results are shown below

These missing values will be handled in the data preparation stage.

Visualization

In this code, we generate a count plot to visualize the distribution of observations in the 'income_above_limit' variable.

There is an Imbalance in the data

We have a highly imbalanced dataset available to us and hence we need to perform steps to mitigate the imbalance accordingly. The following methods could be used:-

Downsample the majority class (Here majority class is 'Below limit')

Upsample the minority class (Here, the minority class is 'Above limit')

The code generates a count plot to visualize the distribution of observations in the 'gender' variable

Gender Variable

There are more females than males in the dataset

In this chart, we visualize the distribution of citizenship groups, and each wedge represents a different citizenship category with the corresponding percentage label.

Citizenship group variable

The majority of the workers are legal US citizens, while 5.9% make up foreigners

Hypothesis

The goal of the statistical test is to examine the data and determine whether there is enough evidence to either reject the null hypothesis in favor of the alternative hypothesis or fail to reject the null hypothesis due to insufficient evidence.

Null Hypothesis (H0): There's no clear connection between how much education someone has and the chance of making more money than a certain amount.

Alternative Hypothesis (H1): People with more education are much more likely to earn above that certain amount.

With a P-value close to zero, we reject the null hypothesis. This leads to the conclusion that there is a significant association between an individual's education level and the likelihood of earning above the specified income threshold. The evidence suggests that people with higher education levels are much more likely to have incomes above the specified threshold.

3. Data Preparation

Upon conducting a thorough exploration and gaining deeper insights into the data, we will then proceed to perform data preprocessing and cleaning. This step is essential to ensure that the data is appropriately prepared for training, and it involves applying various techniques to make the data compatible with the model.

Handling Missing Values

As mentioned earlier in the article, there are missing values in the dataset.

This code performs missing value imputation using the mode for the specified columns in both the training and test datasets. the mode is chosen for imputing missing values in categorical variables because it aligns with the nature of the data and is a meaningful way to handle missing values in this context.

Handling Missing Values

There are no missing values

Standardize and normalize features as necessary.

In this pipeline, this code establishes a data preprocessing pipeline using sci-kit-learn. It utilizes the OneHotEncoder to encode binary and nominal features separately. For binary features, a custom transformer drops one column if the feature is binary. Nominal features are handled with one-hot encoding, ignoring unknown categories and producing a dense array. The entire preprocessing workflow is encapsulated in a ColumnTransformer named preprocessor.

Class Imbalance

This code performs Random Oversampling on the training set using sci-kit-learn, addressing the class imbalance. After oversampling, it prints the balanced class distribution in the training set.

Balancing the training data

Data spriting into training and evaluation set

In this code, we split the resampled data into training and evaluation sets using train_test_split with a specified test size and random state.

3. Modeling

Model Selection

Choose appropriate machine learning algorithms for binary classification.

Consider algorithms such as RandomForestClassifier, CatBoostClassifier, XGBClassifier, and LGBMClassifier

We defined a list of classification models (RandomForestClassifier, CatBoostClassifier, XGBClassifier, and LGBMClassifier) and an empty list to store evaluation metrics. It then iterates over the models, creating a pipeline that includes preprocessing, scaling, and the current model. The pipeline is fitted on the training data, and predictions are made on the evaluation set. Metrics such as accuracy, F1 score, ROC AUC score, precision, and recall are calculated and stored in the 'metrics' list for each model.

Results

Random Forest Classifier was the best model after the evaluation

Here is a confusion matrics RFC

5. Deployment

Then I created an app using Streamlit which I eventually deployed on Hugging Face.

Kindly do well to check out the GitHub repository of the project

You can visit My Hugging Face Profile

You can check out the FastAPI

I also used docker to deploy the app which can also be Docker Image

要查看或添加评论，请登录

Florence Mbabazi的更多文章

Sepsis Classification Machine Learning Project

2023年12月10日

Sepsis Classification Machine Learning Project

Introduction In the constantly evolving landscape of healthcare technology, advancements that not only save lives but…
Fine Tuning a Pre-trained Model for Predicting Sentiments in Tweets About the Covid-19 Vaccine

2023年11月12日

Fine Tuning a Pre-trained Model for Predicting Sentiments in Tweets About the Covid-19 Vaccine

Welcome to another engaging and insightful article. In this piece, we'll delve into the intricacies of adapting a…

1 条评论
Creating and Launching a Streamlit Application for Time Series Forecasting with an XGBoost Model

2023年10月15日

Creating and Launching a Streamlit Application for Time Series Forecasting with an XGBoost Model

Hey there, everyone! In my previous article, I focused on making machine learning models, mainly for folks who know a…
Creating and Launching a Machine Learning Application with Gradio for Customer Churn Prediction.

2023年10月15日

Creating and Launching a Machine Learning Application with Gradio for Customer Churn Prediction.

Greetings and thank you for joining us. Before we dive into the content here, I suggest checking out my second article…
Store Sales -- Time Series Forecasting

2023年9月17日

Store Sales -- Time Series Forecasting

Project Objective This is a time series forecasting problem that predicts store sales on data from Corporation…
Customer Churn Analysis and Prediction

2023年8月20日

Customer Churn Analysis and Prediction

Introduction and Objective In today's competitive business landscape, retaining customers is paramount to sustaining…
Indian Startup Funding Ecosystem: An In-Depth Analysis from 2018 to 2021

2023年7月26日

Indian Startup Funding Ecosystem: An In-Depth Analysis from 2018 to 2021

The Indian startup scene has experienced a remarkable surge in recent years, with a growing number of innovative…

1 条评论

See all articles

Income Prediction Challenge

Florence Mbabazi

Web Developer | Data Analyst | Advanced Excel | PowerBI | SQL | Data Visualization | Python | Machine Learning

领英推荐

Florence Mbabazi的更多文章

社区洞察

其他会员也浏览了

Leveraging the HUBZone Program to Maximize the Impact of Place-Based Industrial Policies

Germany's Demographic Landscape; Challenges & the Strategy

Don't be invisible, complete the National Youth Sector Census.

World Population Day: Understanding Its Importance and Impact

Sustt: Population record, sapling survival, exhausted carbon budgets, pharma firms under pressure, education for the future

Analyzing Disability and Functional Limitations Across Pakistan: Regional Insights, Gaps, and Recommendations (2023 Census)

China: The incredible shrinking country

As incomes in the bottom half of populations grow, does inequality rise or fall?

How different is London from the rest of the UK?

领英推荐

Florence Mbabazi的更多文章

Sepsis Classification Machine Learning Project

Fine Tuning a Pre-trained Model for Predicting Sentiments in Tweets About the Covid-19 Vaccine

Creating and Launching a Streamlit Application for Time Series Forecasting with an XGBoost Model

Creating and Launching a Machine Learning Application with Gradio for Customer Churn Prediction.

Store Sales -- Time Series Forecasting

Customer Churn Analysis and Prediction

Indian Startup Funding Ecosystem: An In-Depth Analysis from 2018 to 2021

社区洞察

其他会员也浏览了

Leveraging the HUBZone Program to Maximize the Impact of Place-Based Industrial Policies

Germany's Demographic Landscape; Challenges & the Strategy

Don't be invisible, complete the National Youth Sector Census.

World Population Day: Understanding Its Importance and Impact

Sustt: Population record, sapling survival, exhausted carbon budgets, pharma firms under pressure, education for the future

Analyzing Disability and Functional Limitations Across Pakistan: Regional Insights, Gaps, and Recommendations (2023 Census)

China: The incredible shrinking country

As incomes in the bottom half of populations grow, does inequality rise or fall?

How different is London from the rest of the UK?