登录查看更多内容

Data Science with Python: Regression Modeling

Nagaraja Kharvi

Technology Leader | 16+ Years in E-Commerce & Digital Transformation | Expertise in AI/ML & Strategic Innovation | Driving Growth at Level Shoes, Chalhoub Group | Proven Success in UAE & Singapore

发布日期: 2024年10月17日

In this article, I will walk you through essential data exploration, cleaning, and processing techniques using Python's pandas library. We'll also dive into regression modeling using the Scikit-learn library. Let's get started!

1. Data Exploration

We begin by loading the data and performing a quick exploration.

import pandas as pd

# Load the dataset
data = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Basic exploration
print(data.shape)
print(data.info())
print(data.describe())
print(data.head())
print(data.isnull().sum())

This allows us to understand the structure of the dataset, view the first few rows, and check for missing values.

2. Slicing Data for Analysis

We can slice data for specific columns and rows for a more granular view.

# Column slicing
print(data["gender"])
print(data[["gender", "Partner"]])

# Row slicing
print(data[5:10])

# Combine row and column slicing
print(data[5:10][["gender", "Partner"]])

3. Conditional Slicing

Sometimes we need to filter data based on conditions. Here’s how to slice the data by gender and churn status.

# Conditional slicing (single condition)
male_customers = data[data["gender"] == "Male"]
print(male_customers["customerID"])

# Conditional slicing (multiple conditions)
male_churn = data[(data["gender"] == "Male") & (data["Churn"] == "Yes")]
print(male_churn)

4. Data Processing

Now, we will handle missing values, drop irrelevant columns, and encode categorical data for model building.

# Fill missing values with mode or mean
data["gender"].fillna(data["gender"].mode()[0], inplace=True)
data["tenure"].fillna(data["tenure"].mean(), inplace=True)

# Drop duplicates and unnecessary columns
data.drop_duplicates(inplace=True)
data.drop(labels=["customerID"], axis=1, inplace=True)

# Label encoding for categorical columns
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data["gender"] = encoder.fit_transform(data["gender"])
data["Partner"] = encoder.fit_transform(data["Partner"])

领英推荐

Data Analysis with Seaborn: Analyzing Data Using…

Benjamin Bennett Alexander 5 个月前

Introduction to NumPy

Rany ElHousieny, PhD??? 1 年前

Modular Markov Chain Monte Carlo in Python

Patrick Nicolas 9 个月前

5. Data Normalization

We need to normalize the numerical data to prepare it for model training.

from sklearn.preprocessing import MinMaxScaler

# Normalize the data
scaler = MinMaxScaler()
x = data.iloc[:, :-1]
x.drop(columns=["MonthlyCharges", "TotalCharges"], inplace=True)
x_scaled = scaler.fit_transform(x)

# Define target variable
y = data.iloc[:, -1]

6. Splitting the Data for Training and Testing

Split the data into training and testing sets to evaluate the model's performance.

from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(x_scaled, y, train_size=0.8)
print(xtrain.shape, xtest.shape)

7. Building and Evaluating a Regression Model

We will build a linear regression model to predict the target variable. The model's performance will be evaluated using R-squared and mean squared error (MSE).

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Train the model
model = LinearRegression()
model.fit(xtrain, ytrain)

# Predict and evaluate
ypred = model.predict(xtest)
r_squared = model.score(xtest, ytest)
mse = mean_squared_error(ytest, ypred)

print(f"R-Squared: {r_squared}")
print(f"Mean Squared Error: {mse}")

8. Visualizing the Results

Lastly, let's visualize the test data and predicted results.

# Visualize test data vs predicted
import matplotlib.pyplot as plt

plt.scatter(xtest[:, 0], ytest, color='blue', label='Actual')
plt.scatter(xtest[:, 0], ypred, color='red', label='Predicted')
plt.legend()
plt.show()

Conclusion

This code demonstrates how to explore, clean, and preprocess data, followed by building and evaluating a regression model. These techniques are essential for building predictive models and deriving insights from data.

要查看或添加评论，请登录

Nagaraja Kharvi的更多文章

Analyzing Loan Data with Machine Learning Models: A Comprehensive Guide

2024年12月3日

Analyzing Loan Data with Machine Learning Models: A Comprehensive Guide

Machine learning has revolutionized the way we analyze and predict outcomes in structured datasets. This article walks…
Data Science with Python: Classification Modeling

2024年11月6日

Data Science with Python: Classification Modeling

Introduction Start with a quick intro to classification modeling, why it’s essential in machine learning, and some…
Overview of Machine and Deep Learning

2024年10月10日

Overview of Machine and Deep Learning

In today’s rapidly changing digital world, sticking to the “basics” is no longer enough. We are in an era where AI and…

3 条评论
Machine Learning and e-commerce

2024年10月7日

Machine Learning and e-commerce

Transforming E-Commerce with Machine Learning: A New Era in Retail As e-commerce continues to evolve, the adoption of…

5 条评论
REST api to fetch MongoDB data - JAVA

2020年1月27日

REST api to fetch MongoDB data - JAVA

This is an article to show how we can create a rest api to fetch mongo database colletion using JAVA and JAVAX. REST…
DYNamic Categories - a real success

2020年1月11日

DYNamic Categories - a real success

People think creating products and assigning to a category is first step for Magento/SFCC but there is more than that…

4 条评论
Configurable/Group Product

2019年12月22日

Configurable/Group Product

We have something called configurable product in MAGENTO and Masters in Demandware (SFCC), which is just the…
Headless/Decoupling + Magento

2019年11月27日

Headless/Decoupling + Magento

Magento is based on Zend - PHP framework with additional layers built by Magento. Additional layers such as blocks…

See all articles

Data Science with Python: Regression Modeling

Nagaraja Kharvi

Technology Leader | 16+ Years in E-Commerce & Digital Transformation | Expertise in AI/ML & Strategic Innovation | Driving Growth at Level Shoes, Chalhoub Group | Proven Success in UAE & Singapore

1. Data Exploration

2. Slicing Data for Analysis

3. Conditional Slicing

4. Data Processing

领英推荐

5. Data Normalization

6. Splitting the Data for Training and Testing

7. Building and Evaluating a Regression Model

8. Visualizing the Results

Conclusion

Nagaraja Kharvi的更多文章

社区洞察

其他会员也浏览了

Power of NumPy: A Fundamental Python Library for Numerical Computing

Top 5 Python Libraries Every Developer Should Know

Basics of NumPy

Numpy

NumPy (Python Library) Overview + Some code

10 Essential Python Libraries for Data Science in 2023

Scikit-learn

Data Preprocessing in Python: Essential Steps for Preparing Data for Machine Learning

Data Analytics Learning Note

NumPy Unleashed: Transforming Data with Python’s Powerful Library

1. Data Exploration

2. Slicing Data for Analysis

3. Conditional Slicing

4. Data Processing

领英推荐

5. Data Normalization

6. Splitting the Data for Training and Testing

7. Building and Evaluating a Regression Model

8. Visualizing the Results

Conclusion

Nagaraja Kharvi的更多文章

Analyzing Loan Data with Machine Learning Models: A Comprehensive Guide

Data Science with Python: Classification Modeling

Overview of Machine and Deep Learning

Machine Learning and e-commerce

REST api to fetch MongoDB data - JAVA

DYNamic Categories - a real success

Configurable/Group Product

Headless/Decoupling + Magento

社区洞察

其他会员也浏览了

Power of NumPy: A Fundamental Python Library for Numerical Computing

Top 5 Python Libraries Every Developer Should Know

Basics of NumPy

Numpy

NumPy (Python Library) Overview + Some code

10 Essential Python Libraries for Data Science in 2023

Scikit-learn

Data Preprocessing in Python: Essential Steps for Preparing Data for Machine Learning

Data Analytics Learning Note

NumPy Unleashed: Transforming Data with Python’s Powerful Library