Linear Regression ML Implementation
Diluksha Shamal
Researcher | Software Engineer | AWS Community Builder | AWS | AI/ML Enthusiast | Experienced in Data Warehousing | Oracle | GenAI | LLM
Loading the Data
Import Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
These libraries provide the tools needed for data manipulation, visualization, and model building.
Load the Dataset:
df = pd.read_csv('your_dataset.csv')
Loading the dataset into a DataFrame allows for easy manipulation and analysis.
Preview the Data:
print(df.head())
Previewing the first few rows helps understand the structure of the data, the types of variables, and the presence of any immediate issues like missing values.
Preparing the Dataset
Data Cleaning:
df.dropna(inplace=True)
Dropping rows with missing values is a straightforward approach, though imputation might be necessary depending on the dataset's nature.
Feature Selection:
sns.pairplot(df)
plt.show()
Summarizing the Statistics
Descriptive Statistics:
print(df.describe())
Graphical Summaries:
Histograms:
df.hist(bins=30, figsize=(10, 8))
plt.show()
Histograms display the distribution of individual features, helping identify skewness and outliers.
Box Plots:
sns.boxplot(data=df)
plt.show()
Box plots are useful for detecting outliers and understanding the distribution of the data.
Checking for Missing Values
Identify Missing Data:
print(df.isnull().sum())
This helps you identify the presence of missing values in the dataset.
Handling Missing Data:
df.fillna(df.mean(), inplace=True) # Example: Impute with column mean
Prompt for Graph: Visualize missing data using a heatmap:
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()
Exploratory Data Analysis (EDA)
sns.scatterplot(x='feature', y='target', data=df)
plt.show()
Scatter plots show the relationship between two continuous variables, helping assess linearity.
Pair Plots:
领英推荐
sns.pairplot(df)
plt.show()
sns.histplot(df['target'], kde=True)
plt.show()
Correlation Analysis
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()
The correlation matrix quantifies the degree to which pairs of variables are linearly related.
Heatmap Visualization:
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()
Standard Scaling
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
Standard scaling transforms the features to have a mean of 0 and a standard deviation of 1.
Model Training
Splitting the Data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Model Training:
regression=LinearRegression()
regression.fit(X_train,y_train)
Training the model involves fitting the linear regression equation to the training data.
Model Evaluation:
reg_pred=regression.predict(x_test)
Use the trained model to predict the target variable on the test set.
Residual Analysis:
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.show()
A residual plot should show no clear pattern; if it does, it suggests issues with model assumptions.
Mean Squared Error (MSE):
mse = mean_squared_error(y_test, y_pred)
Mean Absolute Error (MAE):
mae = mean_absolute_error(y_test, y_pred)
Root Mean Squared Error (RMSE):
rmse = np.sqrt(mse)
plt.scatter(y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')
plt.show()
If you want to practice with a practical example, use the example provided below.
GitHub Link: California-House-Pricing