Linear Regression to Predict Dependency of PM2.5 on CO2 in India using R
Abstract:
This project utilizes robust data and advanced R programming techniques to delve into the intricate relationships between carbon emissions and air quality. Amid contemporary environmental concerns, air pollution stands out as a complex and urgent issue. Its intricate impact on health, ranging from respiratory issues to severe complications like allergies and cancer, emphasizes the need for comprehensive exploration. The primary focus is on uncovering subtle correlations within air pollution data. Through predictive modelling, the study not only analyses current patterns but also serves as a proactive tool to anticipate future trends. This study employs Linear Regression to analyse a dataset, yielding a model with a Residual Standard Error. The RSME and MAE providing insights into the model’s predictive performance.
Introduction:
In the present context, addressing air pollution is a pressing and intricate challenge that demands immediate attention. This pervasive issue not only jeopardizes the environment but also has significant repercussions for public health, ranging from conditions like asthma to more severe outcomes such as cancer. Our project seeks to delve into the complex relationship between carbon emissions and air quality, leveraging dependable data and advanced analytical methods. By employing the R statistical computing language, we aim to unveil connections within air pollution data and discern patterns that can enhance our comprehension of future environmental scenarios. Through this undertaking, we aim to offer valuable insights to facilitate informed decision-making and support endeavours to alleviate the impacts of air pollution.
Methodology:
Data Collection: The first step is to collect a dataset. We used different websites for the integration of data: ? https://www.epa.gov/environmental-topics ? https://data.un.org/default.aspx ? https://data.europa.eu/data/datasets?query=air%20quality%20&lo ? https://bhuvan.nrsc.gov.in/home/index.php ? https://www.epa.gov/climate-change We have curated a data set consisting 10 years of data from the year 2010-2020.
Data Pre-processing: The next step is to pre-process the collected data by removing any irrelevant information and cleaning the data. Checked for empty columns “using is.na” function in R.
Exploratory Data Analysis (EDA): We checked the summary and structure of the dataset to understand its characteristics. Confirmed that there are no missing values in the dataset. This can help us identify Coefficients, p-values, and other relevant statistics from the data. Linear Regression Model: Created a linear regression model using the lm function with “PM2.5” as the dependent variable and “CO2” as the independent variable. Residual Analysis: Plotted the residuals against the fitted values to check for any patterns or outliers. Checked the density plot and quantile-quantile (QQ) plot for normality of residuals.
Model Evaluation: Applied the trained model to the test data to make predictions. Calculated Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) to evaluate the performance of the model on the test data. Visualization: Plotted the residuals against fitted values and checked for normality using QQ plot.
Visualization: Plotted the residuals against fitted values and checked for normality using QQ plot.
Outcomes and Plots
1. Linear regression Functions summary
Call:
1m(formula = P M2.5 sim CO2 , data = PM
Residuals:
Min 1Q Median 3Q Max
-6.522 -3.988 -1.033 3.965 9.402
Coefficients:
Estimate Std. Error t value Pr (>|t|)
(Intercept) 6.015e + 1 6.861e+00 8.766 5.24e-06 *
CO2 1.197e-05 3.471e-06 3.447 0.00626 **
Signif. codes: 0 (*) 0.001 * 0.01 0.05 ''0.1'' 1
Statistical Significance:
The p-value(0.006255), which is less than the typical significance level of 0.05. This suggests that there is evidence to reject the null hypothesis. In other words, CO2 appears to be a statistically significant predictor of PM2.5 levels. Model Fit: R-squared (0.543): This is the proportion of the variance in the dependent variable (PM2.5) that is predictable from the independent variable (CO2). An R-squared of 0. 543 indicates that in the model 54.3% of variations are caused by independent variable.
Density Plot of Residuals:
This provides a visual check for the normality of residuals. Our model assumptions are generally valid if the residuals are approximately normally distributed.
> print I(RMSE)
[1] 6.826735
> print (MAE)
[1] 5.602852
领英推荐
Root Mean Square Error (RMSE): The RMSE is a measure of the average magnitude of the errors between predicted and actual values. In our case, an RMSE of 6.826735 suggests that, on average, our model's predictions are off by approximately 6.826 units of PM2.5. Mean Absolute Error (MAE): The MAE is another measure of prediction accuracy .An MAE of 5.60285 indicates the average absolute difference between predicted and actual PM2.5 values.
The graph represent the relation between the density of the value and the residual values of PM2.5. It represents that how the residual value is deviated from the actual value.
In the Q-Q plot, the points do not fall exactly on a straight line. This suggests that the data is approximately normally distributed, but there are some deviations. The QQ plot we can see shows that the data points are scattered and are not aligned with the 45° means that the data is deviated and it is not exactly normal distributed. We have calculated the Normalized RMSE value: NRMSE= (Calculated RMSE/(Highest Value of PM2.5 – Lowest Value of PM2.5)) which turns out to be (6.826735/(95.24-70.47))= 0.2756
CODE BLOCK:
library(tidyverse) library(dplyr) library(car)
install.packages("Metrics") library("Metrics") library(caret) library(lmtest) #creating data frame by reading the CSV file
PM= as.data.frame( read.csv("C:/Users/dell/Desktop/data1.csv"))
head(PM) summary(PM) str(PM)
#checking for unavailable data in the data frame found no missing data is.na(PM)
#checking column names of data frame colnames(PM) PM_predict=lm(PM2.5~CO2,data=PM)
summary(PM_predict) #Check for heteroscedasticity plot(PM_predict$residuals, PM_predict$fitted.values)
plot(density(PM_predict$residuals))
# Check for normality of residuals qqnorm(PM_predict$residuals)
#splitting the data for training and testing set.seed(123) train_index <- sample(1:nrow(PM), size = nrow(PM) * 0.8) train_data <- PM[train_index, ] test_data <- PM[-train_index, ]
head(train_data) head(test_data) #applying the lm model on the training data trained_model= lm(PM2.5~CO2,data=train_data)
#applying the trained model on test data
predictions = trained_model %>% predict(test_data)
#RootmeanSquareError RMSE = rmse(predictions, test_data$PM2.5) #MeanAbsoluteError MAE = mae(predictions, test_data$PM2.5) print(test_data$PM2.5) print(predictions)
print(RMSE)
print(MAE)
Conclusion:
In this analysis, a linear regression model was developed to explore the relationship between CO2 & PM2.5 (Air Quality Indicator). The model revealed a statistically significant, association between the variables.
? CO2 emerges as a significant predictor, influencing PM2.5 levels.
? While the model explains 54.3% of variability, deviations from normal distribution, as seen in the Q-Q plot, introduce uncertainty.
? Ongoing refinement is crucial to improve predictive accuracy and ensure the reliability of environmental analytics