4 - Graphically Exploring Relationships or Associations in SAS Before Model Building
G?KHAN YAZGAN
PL-300 Microsoft Certified Power BI Data Analyst Associate | Global SAS Certified Specialist: Base Programming Using SAS 9.4
Before building a model we must examine the graphical relationships between the response and predictor variables in order to choose suitable predictor variables in our model. This process guides model building.
Visually Exploring Associations Between Variables in the Context of Anova with Box Plots
So in Anova the response is continuous and the predictor is categorical. We create box-plots to search for potential associations. Response variable will be on the y-axis and the predictor variable will be on the x-axis.
Imagine that we want to build a model explaining the main factors affecting cholestorol levels, So lets start by examining the heart data from sashelp library.
We use PROC CONTENTS procedure to get a general idea about the data.
proc contents data=sashelp.heart varnum;
run;
In the first table we see that our dataset has 5209 observations and 17 variables.
In the third and last table we see the variables list in our heart data. We see variables name, type, length and their labels if they exist.
In this example lets say we are curious about what affects cholesterol, so variable number 13 will be our numerical response variable, and in ANOVA concept we should think about the potential categorical variables that can have an affect on cholesterol variable. Lets choose SEX variable number 3 for this example and examine cholesterol measurements across different categories in SEX variable.
Now lets apply another procedure similar to contents but to print variables list in the log so that we can easily copy and paste them while we write code.
proc sql number;
describe table sashelp.heart;
select * from sashelp.heart(obs=10);
quit;
This code will biring us the list of variables in the log, and first 10 rows of the dataset to get the idea about whats going on the sashelp.heart dataset.
Next lets be sure that there is no missing values in cholesterol variable and search for distinct values in SEX variable.
So first we filter any missing values of Cholesterol and create our own dataset in the temporary work library. This way we can do any kind of data cleaning and manipulation without changing the original dataset, this is a good habit.
proc sql;
create table work.heart as
select * from sashelp.heart
where cholesterol is not null;
quit;
Now we have 5057 observations with no missing values in the cholesterol variable. Now lets see how many distinct values in the SEX variable and check if any missing values exist. From now on we will use our heart dataset which we created at the work library.
proc sql;
select distinct SEX
from work.heart;
quit;
So we have no missing entries, and two categories in the sex variable. Now we are ready to graphically explore relationships between variables.
We will use PROC SGPLOT to achieve our goal.
proc sgplot data=work.heart;
vbox cholesterol / category=sex connect=mean;
title "Cholesterol Differences across Sex";
run;
Our graph shows there is a slight cholesterol difference between genders (Two sided t-test also shows there is a difference). We connect the means with a line, as the slope of the line inreases we are more sceptical that there is a strong relation and we should add our variable to the model.
We can get all the graphs of categorical variables at once by using macros. So we make sure to not miss anything. We can choose the appropriate categorical variables from the variables list and look for the relationships in the graphs.
%let categorical= Smoking_Status Weight_Status BP_Status;
%macro box(dsn = ,
response = ,
Charvar = );
%let i = 1 ;
%do %while(%scan(&charvar,&i,%str( )) ^= %str()) ;
%let var = %scan(&charvar,&i,%str( ));
proc sgplot data=&dsn;
vbox &response / category=&var
grouporder=ascending
connect=mean;
title "&response across Levels of &var";
run;
%let i = %eval(&i + 1 ) ;
%end ;
%mend box;
%box(dsn = work.heart,
response = cholesterol,
charvar = &categorical);
title;
So we have three graphs, because we define three variables in the category variables macro list. So lets take a look at one of them; the relationships between different blood pressure groups and cholesterol variable. So the difference between groups are more obvious here.
So these graphs gave us ideas about which variables to add to our model, but we mustn't solely depend on this, different effects can be possible, so we will use corellation tables also.
Visually Exploring Associations Between Variables in the Context of Linear Regression with Scatter Plots
When we want to investigate the relationship between the continuous response variable and also continuous predictor variable(s) we can use Scatter Plots to achieve our goal.
From the list of variables we can choose the numeric variables for the potential predictors. Lets do that:
%let interval= Height Weight Diastolic Systolic Smoking ;
options nolabel;
proc sgscatter data=work.heart;
plot cholesterol*(&interval) / reg;
title "Associations of Interval Variables with Cholesterol";
run;
Result shows that there is a positive correlation between Diastolic and Systolic Pressure and Cholesterol.
As a conclusion, we mustn't use these plots exclusively to determine which variables to include in our model. They represent only simple relationships between one predictor variable and the response variable. When we start putting multiple variables in the model, the picture of associations can become very different. So we must take some other things into consideration like correlation of variables. But this is absolutely a good start. And maybe we can decide to examine another variable as a response variable after this analyses.