登录查看更多内容

Understanding interactions between the predictors and response variable

Ankush Arya

Data Scientist Partnering with Experts to Enhance Public Health Safety and Innovation!

发布日期: 2022年11月19日

Introduction

Before diving into modelling, I'll show you how to do some exploratory data analysis and document your findings. In this case study, I systematically investigate the connections between the predictors and the outcome. Obviously, reviewing the data and communicating your conclusions are essential steps in the data analysis process, but they also take time and effort that you may not have allotted. Here is the GitHub repo URL to get the entire R code:

Exploratory Data Analysis:

The skincells dataset contains 118 observations from the experiment on the effect of solar radiation on the mortality of human skin cells. The dataset has three variables; the logarithm (base 2) of the number of live cells in the colony extrapolated is taken as a response variable for this analysis. The experiment was repeated over a number of days, and the amount of radiation exposure in minutes.

Figure 1 shows that Day 1 has the highest average of the live cells extrapolated in the experiment. Other days have a lower average. Figure 2 shows the box plot for the Log of the number of cells by times sorted for the medians in descending order, showing colonies with time 2, 3, and 3.5 minutes is similar. The outliers for times of 1.5 and 2.5 move the mean move away from the median.??

Model 1

A linear regression model was fitted with logcells as the response variable and categorical variable date and continuous variable time. In this first model, the interactions between day and time are included.

The model was:

??i?= β0 + β1 x δi, 2 + β2 x δi, 3 + β3 x δi, 4 + β4 x timei + β5 (δi,2 x timei) + β6 (δi,3 x timei) + β7 (δi,4 x timei) + ?i

where ??i is the log of the number of live cells for ith observation. δi, 2, δi, 3 and δi, 4 are the dummy variables for Day 2, Day 3 and Day 4 respectively. The value of δi, 2 = 1 if and only if individual observation i is on Day 2 and zero otherwise and similarly with δi, 3 and δi,4. ?i is the error term for individual observation. The equation for all four days is as follows:

Day 1: ??i = β0 + β4 x timei?????=> 8.03201 – 1.7451 x timei??

Day 2: ??i= β0 + β1 + (β4 + β5) x timei =>8.03201 – 2.4342 + (– 1.7451 + 0.6168) x timei????

Day 3: ??i= β0 + β2 + (β4 + β6) x timei => 8.03201 – 1.8085 + (– 1.7451 + 0.5065) x timei???

Day 4: ??i= β0 + β3 + (β4 + β7) x timei => 8.03201 – 1.4723 + (– 1.7451 + 0.4055) x timei

A test of the null hypothesis H0: β5 = β6 = β7 = 0 can be interpreted as a test of the ‘common slopes assumption’, i.e., the slope for the four days is the same, but their intercepts might differ. This hypothesis was tested using the standard formulation for a linear hypothesis (see R code). The F-test statistic was found to be 8.109 on 7 and 110 degrees of freedom, yielding a p-value of 6.194e-08, which is extremely lower than the 0.05 alpha value. Therefore, we reject the null hypothesis and conclude that predictors are related to the response variable. The model’s geometry shows the lines are converging, as shown in Figure 3.

Findings: Interaction allows for the effect of one predictor to be different based on the value of another predictor. The interaction (see R code) model yields a p-value of 0.7209. Therefore, we failed to reject the null hypothesis and conclude that the model's time interaction is not statistically significant.

Model 2

The following common slopes model was then fitted to the data:

i ?β0 + β1 x δi, 2 + β2 x δi, 3 + β3 x δi, 4 + β4 x timei + ?i

领英推荐

Anomaly Detection Part 2: Isolation Forest

Dr. Anish Roychowdhury, Ph.D. 3 个月前

COVID-19: Counting the Hidden Cases

David E. Goldberg 4 年前

Bayesian probabilistic forecasts using categorical…

Sai Krishna Dammalapati 3 个月前

where i is the log of the number of live cells for ith observation, δi, 2, δi, 3, δi, 4 are the dummy variables for Day 2, Day 3 and Day 4 respectively. The value of δi, 2 = 1 if and only if individual observation i is Day 2 and zero otherwise and similarly with δi, 3 and δi, 4.

The above equation translates to below day wise equations as below:

Day 1: i ?β0 + β4 x timei??=> 7.7248 – 1.3615 x timei??

Day 2: i β0 + β1 + β4 x timei??=> 7.7248 – 1.4451 – 1.3615 x timei??

Day 3: i β0 + β2 + β4 x timei??=> 7.7248 – 1.0014 – 1.3615 x timei??

Day 4: i ?β0 + β1 + β4 x timei??=> 7.7248 – 0.8378 – 1.3615 x timei??

Findings: lm() function (see R code) is used to test the hypothesis H0: β1 = β2 = β3 = β4 = 0, yielding?the p-value 2.413e-09. The effect of time was statistically significant with the slope parameter is -1.3615, yielding a p-value of 2.02e-10. We observe through this model that the time predictor is required in the model, and with every unit increase in time, the logcells decrease by 1.3615 units. The effect of the Day 2 variable is borderline statistically significant, yielding a p-value of 0.0248; we conclude that there is a difference of logcells compared to Day 2 with Day 1. Other parameters are not statistically significant.

?Figure 4 illustrates slopes showing relationships being parallel with different intercepts.

Testing the interaction in the model yields the p-value of 0.1494 for categorical day variable, which is more than the alpha value of 0.05; hence, we fail to reject the hypothesis H0: β1 = β2 = β3 = 0 and conclude that predictor day is not related to response variable logcells.

Model 3

A quadratic effect of time was tested for evidence of a non-straight-line relationship between logcells and time. The categorical variable day is not included in the model.

The model was:

i ?β0 + β1 x timei + β2 (timei2) + ?i

Findings: Hypothesis H0: β1 = β2 = β3 = 0, testing yields a p-value of 3.114e-14 which is extremely low. Therefore, we reject the null hypothesis and conclude that model is statistically significant. The slope parameter of time is -4.3805, suggesting that for every minute increase in the time, there is a decrease of 4.3805 units of log of the number of live cells in the colony.

In Figure 5, the quadratic effect of time is shown along with the linear line from the common slopes. This curved line suggests better describing the model.

Conclusion:

The data analysis considered the two predictors in the data set and checked for the possible interaction between the two variables, i.e., time (continuous) and day (categorical). This analysis suggests that Day 1, Day 2, Day 3 or Day4 were not statistically different (p-value= 0.7209). We can conclude that there is not enough evidence to show that one predictor varies with the value of another predictor.

Next, the data is tested for the predictors to be included in the model. The average log of the number of live cells in the colony extrapolated on a given day number was not found to be statistically different (p-value = 0.1494). The time variable is found to be statically significant (p-value = 2.016e-10).

In conclusion, we determine that Day 1 yields the highest logcells, but there is no evidence to suggest that Day2 or Day3 or Day 4 gives a different number of live cells. Further, there is strong evidence that logcells decrease with radiation exposure in minutes the colony was exposed to in the solar simulator. The polynomial of degree two (quadratic) model explains the distribution better than a straight line fit.

要查看或添加评论，请登录

Ankush Arya的更多文章

Computer Vision : A PyTorch Model Trained on the Stanford Cars Dataset

2024年5月11日

Computer Vision : A PyTorch Model Trained on the Stanford Cars Dataset

Introduction In the rapidly advancing field of computer vision, new breakthroughs constantly reshape the boundaries of…
Customer churn

2022年11月26日

Customer churn

Introduction Churn is the term for the natural and inevitable way that a business will lose customers. In this post, I…

2 条评论
Download any PDF easily with python

2022年2月12日

Download any PDF easily with python

Steps to download pdf files through python Import the requests module and pandas and datetime for data manipulation…

2 条评论
Car model prediction using CNN

2022年1月2日

Car model prediction using CNN

This project is to apply my learnings in Convolutional Neural Networks in TensorFlow from Coursera. Goal Code CNN which…
Automating laptop configuration

2020年10月25日

Automating laptop configuration

I have been working for sometime now to find solution to automate installing software and setting softwares on my…

1 条评论
Is India flattening the curve?

2020年4月29日

Is India flattening the curve?

I am trying to get deeper in the data and trying to understand growth of infected people in India. And try to look into…

1 条评论
COVID19 growth viz-a-viz India

2020年4月21日

COVID19 growth viz-a-viz India

With a population of 1.3 billion and home to the worlds most densely populated cities namely Kolkata, Mumbai and…

6 条评论
Quest for COVID19 Data

2020年4月9日

Quest for COVID19 Data

My quest for understanding the COVID data from the a Data Scientist perspective has taken me on a journey which is…

7 条评论

See all articles

Understanding interactions between the predictors and response variable

Ankush Arya

Data Scientist Partnering with Experts to Enhance Public Health Safety and Innovation!

Introduction

Model 1

Model 2

领英推荐

Model 3

Conclusion:

Ankush Arya的更多文章

社区洞察

其他会员也浏览了

Support Vector Machine

Simulations in Statistics - much more than the general wisdom tells.

Day 6: Support Vector Machines (SVM)

Support Vector Machine To Predict Hospital Length Of Stay.

Log Rank Test for Survival Analysis

A deep dive into... scatter plots

Causal inference packages for RWE and Observational data - a curated list

Poisson Regression

The PAVA-TCE-DS-BCDFD: Pooled Adjacent Violators Algorithm with Test Calibration Error upon Dynamic Significance and Binomial CDF Deviation

Introduction

Model 1

Model 2

领英推荐

Model 3

Conclusion:

Ankush Arya的更多文章

Computer Vision : A PyTorch Model Trained on the Stanford Cars Dataset

Customer churn

Download any PDF easily with python

Car model prediction using CNN

Automating laptop configuration

Is India flattening the curve?

COVID19 growth viz-a-viz India

Quest for COVID19 Data

社区洞察

其他会员也浏览了

Support Vector Machine

Simulations in Statistics - much more than the general wisdom tells.

Day 6: Support Vector Machines (SVM)

Support Vector Machine To Predict Hospital Length Of Stay.

Log Rank Test for Survival Analysis

A deep dive into... scatter plots

Causal inference packages for RWE and Observational data - a curated list

Poisson Regression

The PAVA-TCE-DS-BCDFD: Pooled Adjacent Violators Algorithm with Test Calibration Error upon Dynamic Significance and Binomial CDF Deviation