Data Science with Python: Regression Modeling
Nagaraja Kharvi
Technology Leader | 16+ Years in E-Commerce & Digital Transformation | Expertise in AI/ML & Strategic Innovation | Driving Growth at Level Shoes, Chalhoub Group | Proven Success in UAE & Singapore
In this article, I will walk you through essential data exploration, cleaning, and processing techniques using Python's pandas library. We'll also dive into regression modeling using the Scikit-learn library. Let's get started!
1. Data Exploration
We begin by loading the data and performing a quick exploration.
import pandas as pd
# Load the dataset
data = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
# Basic exploration
print(data.shape)
print(data.info())
print(data.describe())
print(data.head())
print(data.isnull().sum())
This allows us to understand the structure of the dataset, view the first few rows, and check for missing values.
2. Slicing Data for Analysis
We can slice data for specific columns and rows for a more granular view.
# Column slicing
print(data["gender"])
print(data[["gender", "Partner"]])
# Row slicing
print(data[5:10])
# Combine row and column slicing
print(data[5:10][["gender", "Partner"]])
3. Conditional Slicing
Sometimes we need to filter data based on conditions. Here’s how to slice the data by gender and churn status.
# Conditional slicing (single condition)
male_customers = data[data["gender"] == "Male"]
print(male_customers["customerID"])
# Conditional slicing (multiple conditions)
male_churn = data[(data["gender"] == "Male") & (data["Churn"] == "Yes")]
print(male_churn)
4. Data Processing
Now, we will handle missing values, drop irrelevant columns, and encode categorical data for model building.
# Fill missing values with mode or mean
data["gender"].fillna(data["gender"].mode()[0], inplace=True)
data["tenure"].fillna(data["tenure"].mean(), inplace=True)
# Drop duplicates and unnecessary columns
data.drop_duplicates(inplace=True)
data.drop(labels=["customerID"], axis=1, inplace=True)
# Label encoding for categorical columns
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data["gender"] = encoder.fit_transform(data["gender"])
data["Partner"] = encoder.fit_transform(data["Partner"])
领英推荐
5. Data Normalization
We need to normalize the numerical data to prepare it for model training.
from sklearn.preprocessing import MinMaxScaler
# Normalize the data
scaler = MinMaxScaler()
x = data.iloc[:, :-1]
x.drop(columns=["MonthlyCharges", "TotalCharges"], inplace=True)
x_scaled = scaler.fit_transform(x)
# Define target variable
y = data.iloc[:, -1]
6. Splitting the Data for Training and Testing
Split the data into training and testing sets to evaluate the model's performance.
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x_scaled, y, train_size=0.8)
print(xtrain.shape, xtest.shape)
7. Building and Evaluating a Regression Model
We will build a linear regression model to predict the target variable. The model's performance will be evaluated using R-squared and mean squared error (MSE).
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Train the model
model = LinearRegression()
model.fit(xtrain, ytrain)
# Predict and evaluate
ypred = model.predict(xtest)
r_squared = model.score(xtest, ytest)
mse = mean_squared_error(ytest, ypred)
print(f"R-Squared: {r_squared}")
print(f"Mean Squared Error: {mse}")
8. Visualizing the Results
Lastly, let's visualize the test data and predicted results.
# Visualize test data vs predicted
import matplotlib.pyplot as plt
plt.scatter(xtest[:, 0], ytest, color='blue', label='Actual')
plt.scatter(xtest[:, 0], ypred, color='red', label='Predicted')
plt.legend()
plt.show()
Conclusion
This code demonstrates how to explore, clean, and preprocess data, followed by building and evaluating a regression model. These techniques are essential for building predictive models and deriving insights from data.