登录查看更多内容

Linear Regression ML Implementation

Diluksha Shamal

Researcher | Software Engineer | AWS Community Builder | AWS | AI/ML Enthusiast | Experienced in Data Warehousing | Oracle | GenAI | LLM

发布日期: 2024年8月24日

+ 关注

Definition: Linear regression is a statistical method that models the relationship between a dependent variable (target) and one or more independent variables (features) using a linear equation. It's commonly used in predictive modeling and data analysis to understand the influence of variables and predict future outcomes.
Objective: This documentation aims to guide you through implementing a linear regression model, covering data preparation, model building, evaluation, and interpretation of results. We will focus on creating a robust pipeline that ensures accuracy and reliability.

Loading the Data

Import Libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

These libraries provide the tools needed for data manipulation, visualization, and model building.

Load the Dataset:

df = pd.read_csv('your_dataset.csv')

Loading the dataset into a DataFrame allows for easy manipulation and analysis.

Preview the Data:

print(df.head())

Previewing the first few rows helps understand the structure of the data, the types of variables, and the presence of any immediate issues like missing values.

Preparing the Dataset

Data Cleaning:

Before building the model, it's crucial to clean the dataset. This involves handling missing values, removing duplicates, and converting categorical variables to numerical formats (if applicable).
Handling Missing Values:

df.dropna(inplace=True)

Dropping rows with missing values is a straightforward approach, though imputation might be necessary depending on the dataset's nature.

Feature Selection:

Select features that are most relevant to predicting the target variable. Feature selection can improve model performance and reduce overfitting.
Prompt for Graph: Generate a pair plot to visualize relationships between features and the target variable:

sns.pairplot(df)
plt.show()

Summarizing the Statistics

Descriptive Statistics:

Descriptive statistics provide a summary of the central tendency, dispersion, and shape of a dataset's distribution.

print(df.describe())

Graphical Summaries:

Histograms:

df.hist(bins=30, figsize=(10, 8))
plt.show()

Histograms display the distribution of individual features, helping identify skewness and outliers.

Box Plots:

sns.boxplot(data=df)
plt.show()

Box plots are useful for detecting outliers and understanding the distribution of the data.

Checking for Missing Values

Identify Missing Data:

print(df.isnull().sum())

This helps you identify the presence of missing values in the dataset.

Handling Missing Data:

Imputation: Depending on the data, you can impute missing values with the mean, median, or mode of the column.

df.fillna(df.mean(), inplace=True)  # Example: Impute with column mean

Prompt for Graph: Visualize missing data using a heatmap:

sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()

Exploratory Data Analysis (EDA)

Data Visualization:Explanation: EDA involves generating plots to explore the relationships between variables. This can reveal patterns, trends, and potential issues with the data.
Scatter Plots:

sns.scatterplot(x='feature', y='target', data=df)
plt.show()

Scatter plots show the relationship between two continuous variables, helping assess linearity.

Pair Plots:

领英推荐

Feature Engineering Best Practices A Guide for Data…

EkasCloud London 3 个月前

You want to be a data guru?

Manish Kushwaha 3 年前

Statistical Modelling

Darshika Srivastava 2 年前

sns.pairplot(df)
plt.show()

Pair plots provide a matrix of scatter plots, useful for examining interactions between all features.
Prompt for Graph: Visualize distribution of the target variable to check for normality:

sns.histplot(df['target'], kde=True)
plt.show()

Correlation Analysis

Correlation Matrix:

corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

The correlation matrix quantifies the degree to which pairs of variables are linearly related.

Heatmap Visualization:

Prompt to generate a heatmap to visualize correlations and identify multicollinearity.

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

Standard Scaling

Why Scaling?:Explanation: Feature scaling ensures that all features contribute equally to the model. In linear regression, it’s particularly important if the features have different units or scales.
StandardScaler:

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

Standard scaling transforms the features to have a mean of 0 and a standard deviation of 1.

Model Training

Splitting the Data:

Explanation: Splitting the dataset into training and testing sets allows you to assess the model's performance on unseen data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model Training:

regression=LinearRegression()
regression.fit(X_train,y_train)

Training the model involves fitting the linear regression equation to the training data.

Model Evaluation:

reg_pred=regression.predict(x_test)

Use the trained model to predict the target variable on the test set.

Residual Analysis:

Explanation: Residuals are the differences between the observed and predicted values. Analyzing residuals helps assess model accuracy.

residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.show()

A residual plot should show no clear pattern; if it does, it suggests issues with model assumptions.

Mean Squared Error (MSE):

Explanation: MSE measures the average squared difference between observed and predicted values, penalizing larger errors.

mse = mean_squared_error(y_test, y_pred)

Mean Absolute Error (MAE):

Explanation: MAE measures the average magnitude of errors in predictions, without considering their direction.
Calculation:

mae = mean_absolute_error(y_test, y_pred)

Root Mean Squared Error (RMSE):

Explanation: RMSE provides a measure of the average magnitude of errors, but in the same units as the target variable.
Calculation:

rmse = np.sqrt(mse)

Prompt for Graph: Compare actual vs. predicted values to visually assess model performance:

plt.scatter(y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')
plt.show()

If you want to practice with a practical example, use the example provided below.

GitHub Link: California-House-Pricing

要查看或添加评论，请登录

Diluksha Shamal的更多文章

?? Machine Learning is Not a Solution for All Problems ??

2024年8月15日

?? Machine Learning is Not a Solution for All Problems ??

In recent years, machine learning (ML) has become a buzzword in various industries. From finance to healthcare, and…