Understanding interactions between the predictors and response variable
https://pixabay.com/

Understanding interactions between the predictors and response variable

Introduction

Before diving into modelling, I'll show you how to do some exploratory data analysis and document your findings. In this case study, I systematically investigate the connections between the predictors and the outcome. Obviously, reviewing the data and communicating your conclusions are essential steps in the data analysis process, but they also take time and effort that you may not have allotted. Here is the GitHub repo URL to get the entire R code:

Exploratory Data Analysis:

The skincells dataset contains 118 observations from the experiment on the effect of solar radiation on the mortality of human skin cells. The dataset has three variables; the logarithm (base 2) of the number of live cells in the colony extrapolated is taken as a response variable for this analysis. The experiment was repeated over a number of days, and the amount of radiation exposure in minutes.

No alt text provided for this image

Figure 1 shows that Day 1 has the highest average of the live cells extrapolated in the experiment. Other days have a lower average. Figure 2 shows the box plot for the Log of the number of cells by times sorted for the medians in descending order, showing colonies with time 2, 3, and 3.5 minutes is similar. The outliers for times of 1.5 and 2.5 move the mean move away from the median.??

No alt text provided for this image

Model 1

A linear regression model was fitted with logcells as the response variable and categorical variable date and continuous variable time. In this first model, the interactions between day and time are included.

The model was:

??i?= β0 + β1 x δi, 2 + β2 x δi, 3 + β3 x δi, 4 + β4 x timei + β5 (δi,2 x timei) + β6 (δi,3 x timei) + β7 (δi,4 x timei) + ?i

where ??i is the log of the number of live cells for ith observation. δi, 2, δi, 3 and δi, 4 are the dummy variables for Day 2, Day 3 and Day 4 respectively. The value of δi, 2 = 1 if and only if individual observation i is on Day 2 and zero otherwise and similarly with δi, 3 and δi,4. ?i is the error term for individual observation. The equation for all four days is as follows:

Day 1: ??i = β0 + β4 x timei?????=> 8.03201 – 1.7451 x timei??

Day 2: ??i= β0 + β1 + (β4 + β5) x timei =>8.03201 – 2.4342 + (– 1.7451 + 0.6168) x timei????

Day 3: ??i= β0 + β2 + (β4 + β6) x timei => 8.03201 – 1.8085 + (– 1.7451 + 0.5065) x timei???

Day 4: ??i= β0 + β3 + (β4 + β7) x timei => 8.03201 – 1.4723 + (– 1.7451 + 0.4055) x timei

A test of the null hypothesis H0: β5 = β6 = β7 = 0 can be interpreted as a test of the ‘common slopes assumption’, i.e., the slope for the four days is the same, but their intercepts might differ. This hypothesis was tested using the standard formulation for a linear hypothesis (see R code). The F-test statistic was found to be 8.109 on 7 and 110 degrees of freedom, yielding a p-value of 6.194e-08, which is extremely lower than the 0.05 alpha value. Therefore, we reject the null hypothesis and conclude that predictors are related to the response variable. The model’s geometry shows the lines are converging, as shown in Figure 3.

No alt text provided for this image

Findings: Interaction allows for the effect of one predictor to be different based on the value of another predictor. The interaction (see R code) model yields a p-value of 0.7209. Therefore, we failed to reject the null hypothesis and conclude that the model's time interaction is not statistically significant.

Model 2

The following common slopes model was then fitted to the data:

i ?β0 + β1 x δi, 2 + β2 x δi, 3 + β3 x δi, 4 + β4 x timei + ?i

where i is the log of the number of live cells for ith observation, δi, 2, δi, 3, δi, 4 are the dummy variables for Day 2, Day 3 and Day 4 respectively. The value of δi, 2 = 1 if and only if individual observation i is Day 2 and zero otherwise and similarly with δi, 3 and δi, 4.

The above equation translates to below day wise equations as below:

Day 1: i ?β0 + β4 x timei??=> 7.7248 – 1.3615 x timei??

Day 2: i β0 + β1 + β4 x timei??=> 7.7248 – 1.4451 – 1.3615 x timei??

Day 3: i β0 + β2 + β4 x timei??=> 7.7248 – 1.0014 – 1.3615 x timei??


Day 4: i ?β0 + β1 + β4 x timei??=> 7.7248 – 0.8378 – 1.3615 x timei??

Findings: lm() function (see R code) is used to test the hypothesis H0: β1 = β2 = β3 = β4 = 0, yielding?the p-value 2.413e-09. The effect of time was statistically significant with the slope parameter is -1.3615, yielding a p-value of 2.02e-10. We observe through this model that the time predictor is required in the model, and with every unit increase in time, the logcells decrease by 1.3615 units. The effect of the Day 2 variable is borderline statistically significant, yielding a p-value of 0.0248; we conclude that there is a difference of logcells compared to Day 2 with Day 1. Other parameters are not statistically significant.

No alt text provided for this image

?Figure 4 illustrates slopes showing relationships being parallel with different intercepts.

Testing the interaction in the model yields the p-value of 0.1494 for categorical day variable, which is more than the alpha value of 0.05; hence, we fail to reject the hypothesis H0: β1 = β2 = β3 = 0 and conclude that predictor day is not related to response variable logcells.

Model 3

A quadratic effect of time was tested for evidence of a non-straight-line relationship between logcells and time. The categorical variable day is not included in the model.

The model was:

i ?β0 + β1 x timei + β2 (timei2) + ?i

Findings: Hypothesis H0: β1 = β2 = β3 = 0, testing yields a p-value of 3.114e-14 which is extremely low. Therefore, we reject the null hypothesis and conclude that model is statistically significant. The slope parameter of time is -4.3805, suggesting that for every minute increase in the time, there is a decrease of 4.3805 units of log of the number of live cells in the colony.

No alt text provided for this image

In Figure 5, the quadratic effect of time is shown along with the linear line from the common slopes. This curved line suggests better describing the model.

Conclusion:

The data analysis considered the two predictors in the data set and checked for the possible interaction between the two variables, i.e., time (continuous) and day (categorical). This analysis suggests that Day 1, Day 2, Day 3 or Day4 were not statistically different (p-value= 0.7209). We can conclude that there is not enough evidence to show that one predictor varies with the value of another predictor.

Next, the data is tested for the predictors to be included in the model. The average log of the number of live cells in the colony extrapolated on a given day number was not found to be statistically different (p-value = 0.1494). The time variable is found to be statically significant (p-value = 2.016e-10).

In conclusion, we determine that Day 1 yields the highest logcells, but there is no evidence to suggest that Day2 or Day3 or Day 4 gives a different number of live cells. Further, there is strong evidence that logcells decrease with radiation exposure in minutes the colony was exposed to in the solar simulator. The polynomial of degree two (quadratic) model explains the distribution better than a straight line fit.

要查看或添加评论,请登录

Ankush Arya的更多文章

  • Computer Vision : A PyTorch Model Trained on the Stanford Cars Dataset

    Computer Vision : A PyTorch Model Trained on the Stanford Cars Dataset

    Introduction In the rapidly advancing field of computer vision, new breakthroughs constantly reshape the boundaries of…

  • Customer churn

    Customer churn

    Introduction Churn is the term for the natural and inevitable way that a business will lose customers. In this post, I…

    2 条评论
  • Download any PDF easily with python

    Download any PDF easily with python

    Steps to download pdf files through python Import the requests module and pandas and datetime for data manipulation…

    2 条评论
  • Car model prediction using CNN

    Car model prediction using CNN

    This project is to apply my learnings in Convolutional Neural Networks in TensorFlow from Coursera. Goal Code CNN which…

  • Automating laptop configuration

    Automating laptop configuration

    I have been working for sometime now to find solution to automate installing software and setting softwares on my…

    1 条评论
  • Is India flattening the curve?

    Is India flattening the curve?

    I am trying to get deeper in the data and trying to understand growth of infected people in India. And try to look into…

    1 条评论
  • COVID19 growth viz-a-viz India

    COVID19 growth viz-a-viz India

    With a population of 1.3 billion and home to the worlds most densely populated cities namely Kolkata, Mumbai and…

    6 条评论
  • Quest for COVID19 Data

    Quest for COVID19 Data

    My quest for understanding the COVID data from the a Data Scientist perspective has taken me on a journey which is…

    7 条评论

社区洞察

其他会员也浏览了