Income Prediction Challenge
Income Prediction

Income Prediction Challenge

Income inequality - when income is distributed unevenly among a population - is a growing problem in developing nations across the world. With the rapid rise of AI and worker automation, this problem could continue to grow if steps are not taken to address the issue.

The outline for this project will follow the CRISP-DM framework stage. The outline is as follows:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modelling
  • Evaluation
  • Deployment

In addition to following the framework as a reference, I will also address the following aspects:

Hugging Face App

Docker Image

FastAPI

Now let us begin with the project.

1. Business Understanding

Income inequality - when income is distributed unevenly among a population - is a growing problem in developing nations across the world. With the rapid rise of AI and worker automation, this problem could continue to grow if steps are not taken to address the issue.

The objective of this challenge is to create a machine-learning model to predict whether an individual earns above or below a certain amount.

This solution can potentially reduce the cost and improve the accuracy of monitoring key population indicators such as income level in between census years. This information will help policymakers to better manage and avoid income inequality globally.

Objectives

  • Develop a predictive model to categorize individuals based on income levels.
  • Reduce the cost and improve the accuracy of monitoring key population indicators.
  • Assist policymakers in managing and mitigating global income inequality.

2. Data Understanding

Data Collection

In this project, we identified and acquired relevant datasets containing information on income and related features. The dataset resides on the Zindi platform, and for ease of access and security, we have made the datasets available in the GitHub Repository.

Before we explore each of the datasets we will want to know what the columns in the dataset represent.

Column names and description

age, Age Of Individual

gender, Gender

education, Education

class, Class Of Worker

education_institute, Enrolled Educational Institution in last week

marital_status,Marital_Status

race, Race

is_hispanic, Hispanic Origin

employment_commitment, Full Or Part Time Employment Stat

unemployment_reason, Reason For Unemployment

employment_stat, Has Own Business Or Is Self Employed

wage_per_hour, Wage Per Hour

is_labor_union, Member Of A Labor Union

working_week_per_year, Weeks Worked In A Year

industry_code,Category

industry_code_main, Major Industry Code

occupation_code,Category

occupation_code_main, Major Occupation Code

total_employed, Num Persons Worked For Employer

household_stat, Detailed Household And Family Stat

household_summary, Detailed Household Summary In Household

under_18_family, Family Members Under 18

veterans_admin_questionnaire, Fill Inc Questionnaire For Veterans Admin

vet_benefit,Benefits

tax_status, Tax Filer Status

gains, Gains

losses, Losses

stocks_status, Dividends From Stocks

citizenship, Citizenship

mig_year,Year

country_of_birth_own, Individual's Birth Country

country_of_birth_father, Father's Birth Country

country_of_birth_mother, Mother's Birth Country

migration_code_change_in_msa, Migration Code Change In Msa

migration_prev_sunbelt, Migration Prev Sunbelt

migration_code_move_within_reg, Migration Code Move Within Reg

migration_code_change_in_reg, Migration Code Change In Reg

residence_1_year_ago, Live In This House Year Ago?

old_residence_reg, Region Of Previous Residence

old_residence_state, State Of Previous Residence

importance_of_record, Weight Of Instance

income_above_limit,Is income above 50k?

Exploratory Data Analysis

The following section presents the exploratory data analysis (EDA) conducted on the different datasets. We will initiate the analysis with the train dataset

In this phase, we will explore the data by examining the dataset's structure to understand its features and variables. We will also identify missing values, outliers, and potential data quality issues. Additionally, descriptive statistics will be performed to gain initial insights into the data.

Libraries Used:

I used these libraries for preprocessing and visualizations

Data Overview and EDA

In this section, we will explore the data to see if there are any inconsistencies or irregularities that may be present.

General Information Of the Data

Checking for Missing Values

Missing Values

The Results are shown below

These missing values will be handled in the data preparation stage.

Visualization

In this code, we generate a count plot to visualize the distribution of observations in the 'income_above_limit' variable.

There is an Imbalance in the data

We have a highly imbalanced dataset available to us and hence we need to perform steps to mitigate the imbalance accordingly. The following methods could be used:-

Downsample the majority class (Here majority class is 'Below limit')

Upsample the minority class (Here, the minority class is 'Above limit')

The code generates a count plot to visualize the distribution of observations in the 'gender' variable

Gender Variable

There are more females than males in the dataset

In this chart, we visualize the distribution of citizenship groups, and each wedge represents a different citizenship category with the corresponding percentage label.

Citizenship group variable

The majority of the workers are legal US citizens, while 5.9% make up foreigners

Hypothesis

The goal of the statistical test is to examine the data and determine whether there is enough evidence to either reject the null hypothesis in favor of the alternative hypothesis or fail to reject the null hypothesis due to insufficient evidence.

Null Hypothesis (H0): There's no clear connection between how much education someone has and the chance of making more money than a certain amount.

Alternative Hypothesis (H1): People with more education are much more likely to earn above that certain amount.

With a P-value close to zero, we reject the null hypothesis. This leads to the conclusion that there is a significant association between an individual's education level and the likelihood of earning above the specified income threshold. The evidence suggests that people with higher education levels are much more likely to have incomes above the specified threshold.

3. Data Preparation

Upon conducting a thorough exploration and gaining deeper insights into the data, we will then proceed to perform data preprocessing and cleaning. This step is essential to ensure that the data is appropriately prepared for training, and it involves applying various techniques to make the data compatible with the model.

Handling Missing Values

As mentioned earlier in the article, there are missing values in the dataset.

This code performs missing value imputation using the mode for the specified columns in both the training and test datasets. the mode is chosen for imputing missing values in categorical variables because it aligns with the nature of the data and is a meaningful way to handle missing values in this context.

Handling Missing Values

There are no missing values

Standardize and normalize features as necessary.

In this pipeline, this code establishes a data preprocessing pipeline using sci-kit-learn. It utilizes the OneHotEncoder to encode binary and nominal features separately. For binary features, a custom transformer drops one column if the feature is binary. Nominal features are handled with one-hot encoding, ignoring unknown categories and producing a dense array. The entire preprocessing workflow is encapsulated in a ColumnTransformer named preprocessor.

Pipeline

Class Imbalance

This code performs Random Oversampling on the training set using sci-kit-learn, addressing the class imbalance. After oversampling, it prints the balanced class distribution in the training set.

Balancing the training data

Data spriting into training and evaluation set

In this code, we split the resampled data into training and evaluation sets using train_test_split with a specified test size and random state.

Split data set

3. Modeling

Model Selection

Choose appropriate machine learning algorithms for binary classification.

Consider algorithms such as RandomForestClassifier, CatBoostClassifier, XGBClassifier, and LGBMClassifier

We defined a list of classification models (RandomForestClassifier, CatBoostClassifier, XGBClassifier, and LGBMClassifier) and an empty list to store evaluation metrics. It then iterates over the models, creating a pipeline that includes preprocessing, scaling, and the current model. The pipeline is fitted on the training data, and predictions are made on the evaluation set. Metrics such as accuracy, F1 score, ROC AUC score, precision, and recall are calculated and stored in the 'metrics' list for each model.

List of models

Results

Best Model

Random Forest Classifier was the best model after the evaluation

Here is a confusion matrics RFC

Confusion Matrix

5. Deployment

Then I created an app using Streamlit which I eventually deployed on Hugging Face.

Kindly do well to check out the GitHub repository of the project

You can visit My Hugging Face Profile

You can check out the FastAPI

I also used docker to deploy the app which can also be Docker Image

要查看或添加评论,请登录

Florence Mbabazi的更多文章

社区洞察

其他会员也浏览了