Income Prediction Challenge
Florence Mbabazi
Web Developer | Data Analyst | Advanced Excel | PowerBI | SQL | Data Visualization | Python | Machine Learning
Income inequality - when income is distributed unevenly among a population - is a growing problem in developing nations across the world. With the rapid rise of AI and worker automation, this problem could continue to grow if steps are not taken to address the issue.
The outline for this project will follow the CRISP-DM framework stage. The outline is as follows:
In addition to following the framework as a reference, I will also address the following aspects:
Hugging Face App
Docker Image
FastAPI
Now let us begin with the project.
1. Business Understanding
Income inequality - when income is distributed unevenly among a population - is a growing problem in developing nations across the world. With the rapid rise of AI and worker automation, this problem could continue to grow if steps are not taken to address the issue.
The objective of this challenge is to create a machine-learning model to predict whether an individual earns above or below a certain amount.
This solution can potentially reduce the cost and improve the accuracy of monitoring key population indicators such as income level in between census years. This information will help policymakers to better manage and avoid income inequality globally.
Objectives
2. Data Understanding
Data Collection
In this project, we identified and acquired relevant datasets containing information on income and related features. The dataset resides on the Zindi platform, and for ease of access and security, we have made the datasets available in the GitHub Repository.
Before we explore each of the datasets we will want to know what the columns in the dataset represent.
Column names and description
age, Age Of Individual
gender, Gender
education, Education
class, Class Of Worker
education_institute, Enrolled Educational Institution in last week
marital_status,Marital_Status
race, Race
is_hispanic, Hispanic Origin
employment_commitment, Full Or Part Time Employment Stat
unemployment_reason, Reason For Unemployment
employment_stat, Has Own Business Or Is Self Employed
wage_per_hour, Wage Per Hour
is_labor_union, Member Of A Labor Union
working_week_per_year, Weeks Worked In A Year
industry_code,Category
industry_code_main, Major Industry Code
occupation_code,Category
occupation_code_main, Major Occupation Code
total_employed, Num Persons Worked For Employer
household_stat, Detailed Household And Family Stat
household_summary, Detailed Household Summary In Household
under_18_family, Family Members Under 18
veterans_admin_questionnaire, Fill Inc Questionnaire For Veterans Admin
vet_benefit,Benefits
tax_status, Tax Filer Status
gains, Gains
losses, Losses
stocks_status, Dividends From Stocks
citizenship, Citizenship
mig_year,Year
country_of_birth_own, Individual's Birth Country
country_of_birth_father, Father's Birth Country
country_of_birth_mother, Mother's Birth Country
migration_code_change_in_msa, Migration Code Change In Msa
migration_prev_sunbelt, Migration Prev Sunbelt
migration_code_move_within_reg, Migration Code Move Within Reg
migration_code_change_in_reg, Migration Code Change In Reg
residence_1_year_ago, Live In This House Year Ago?
old_residence_reg, Region Of Previous Residence
old_residence_state, State Of Previous Residence
importance_of_record, Weight Of Instance
income_above_limit,Is income above 50k?
Exploratory Data Analysis
The following section presents the exploratory data analysis (EDA) conducted on the different datasets. We will initiate the analysis with the train dataset
In this phase, we will explore the data by examining the dataset's structure to understand its features and variables. We will also identify missing values, outliers, and potential data quality issues. Additionally, descriptive statistics will be performed to gain initial insights into the data.
Libraries Used:
领英推荐
Data Overview and EDA
In this section, we will explore the data to see if there are any inconsistencies or irregularities that may be present.
Checking for Missing Values
The Results are shown below
These missing values will be handled in the data preparation stage.
Visualization
In this code, we generate a count plot to visualize the distribution of observations in the 'income_above_limit' variable.
There is an Imbalance in the data
We have a highly imbalanced dataset available to us and hence we need to perform steps to mitigate the imbalance accordingly. The following methods could be used:-
Downsample the majority class (Here majority class is 'Below limit')
Upsample the minority class (Here, the minority class is 'Above limit')
The code generates a count plot to visualize the distribution of observations in the 'gender' variable
Gender Variable
There are more females than males in the dataset
In this chart, we visualize the distribution of citizenship groups, and each wedge represents a different citizenship category with the corresponding percentage label.
Citizenship group variable
The majority of the workers are legal US citizens, while 5.9% make up foreigners
Hypothesis
The goal of the statistical test is to examine the data and determine whether there is enough evidence to either reject the null hypothesis in favor of the alternative hypothesis or fail to reject the null hypothesis due to insufficient evidence.
Null Hypothesis (H0): There's no clear connection between how much education someone has and the chance of making more money than a certain amount.
Alternative Hypothesis (H1): People with more education are much more likely to earn above that certain amount.
With a P-value close to zero, we reject the null hypothesis. This leads to the conclusion that there is a significant association between an individual's education level and the likelihood of earning above the specified income threshold. The evidence suggests that people with higher education levels are much more likely to have incomes above the specified threshold.
3. Data Preparation
Upon conducting a thorough exploration and gaining deeper insights into the data, we will then proceed to perform data preprocessing and cleaning. This step is essential to ensure that the data is appropriately prepared for training, and it involves applying various techniques to make the data compatible with the model.
Handling Missing Values
As mentioned earlier in the article, there are missing values in the dataset.
This code performs missing value imputation using the mode for the specified columns in both the training and test datasets. the mode is chosen for imputing missing values in categorical variables because it aligns with the nature of the data and is a meaningful way to handle missing values in this context.
Handling Missing Values
There are no missing values
Standardize and normalize features as necessary.
In this pipeline, this code establishes a data preprocessing pipeline using sci-kit-learn. It utilizes the OneHotEncoder to encode binary and nominal features separately. For binary features, a custom transformer drops one column if the feature is binary. Nominal features are handled with one-hot encoding, ignoring unknown categories and producing a dense array. The entire preprocessing workflow is encapsulated in a ColumnTransformer named preprocessor.
Class Imbalance
This code performs Random Oversampling on the training set using sci-kit-learn, addressing the class imbalance. After oversampling, it prints the balanced class distribution in the training set.
Balancing the training data
Data spriting into training and evaluation set
In this code, we split the resampled data into training and evaluation sets using train_test_split with a specified test size and random state.
3. Modeling
Model Selection
Choose appropriate machine learning algorithms for binary classification.
Consider algorithms such as RandomForestClassifier, CatBoostClassifier, XGBClassifier, and LGBMClassifier
We defined a list of classification models (RandomForestClassifier, CatBoostClassifier, XGBClassifier, and LGBMClassifier) and an empty list to store evaluation metrics. It then iterates over the models, creating a pipeline that includes preprocessing, scaling, and the current model. The pipeline is fitted on the training data, and predictions are made on the evaluation set. Metrics such as accuracy, F1 score, ROC AUC score, precision, and recall are calculated and stored in the 'metrics' list for each model.
Results
Random Forest Classifier was the best model after the evaluation
Here is a confusion matrics RFC
5. Deployment
Then I created an app using Streamlit which I eventually deployed on Hugging Face.
Kindly do well to check out the GitHub repository of the project
You can visit My Hugging Face Profile
You can check out the FastAPI
I also used docker to deploy the app which can also be Docker Image