4 - Graphically Exploring Relationships or Associations in SAS Before Model Building
Identifying Relationships Between Variables Before Building a Model

4 - Graphically Exploring Relationships or Associations in SAS Before Model Building

Before building a model we must examine the graphical relationships between the response and predictor variables in order to choose suitable predictor variables in our model. This process guides model building.

Visually Exploring Associations Between Variables in the Context of Anova with Box Plots

So in Anova the response is continuous and the predictor is categorical. We create box-plots to search for potential associations. Response variable will be on the y-axis and the predictor variable will be on the x-axis.

Imagine that we want to build a model explaining the main factors affecting cholestorol levels, So lets start by examining the heart data from sashelp library.

We use PROC CONTENTS procedure to get a general idea about the data.

proc contents data=sashelp.heart varnum;
run;        

In the first table we see that our dataset has 5209 observations and 17 variables.


First table of the PROC CONTENTS result.

In the third and last table we see the variables list in our heart data. We see variables name, type, length and their labels if they exist.


Varnum statement changes the alphabetical order of columns list into the order as it occurs in the dataset.

In this example lets say we are curious about what affects cholesterol, so variable number 13 will be our numerical response variable, and in ANOVA concept we should think about the potential categorical variables that can have an affect on cholesterol variable. Lets choose SEX variable number 3 for this example and examine cholesterol measurements across different categories in SEX variable.

Now lets apply another procedure similar to contents but to print variables list in the log so that we can easily copy and paste them while we write code.

proc sql number;
 describe table sashelp.heart;
 select * from sashelp.heart(obs=10);
quit;        

This code will biring us the list of variables in the log, and first 10 rows of the dataset to get the idea about whats going on the sashelp.heart dataset.

With the describe table statement in PROC SQL we get the list, name, type and label of our variables ready for copying to our code.


First 10 rows of the HEART dataset in SASHELP library of SAS. Perfect way to comprehend whats going on in the data.

Next lets be sure that there is no missing values in cholesterol variable and search for distinct values in SEX variable.

So first we filter any missing values of Cholesterol and create our own dataset in the temporary work library. This way we can do any kind of data cleaning and manipulation without changing the original dataset, this is a good habit.

proc sql;
create table work.heart as
select * from sashelp.heart
where cholesterol is not null;
quit;        


Creating new heart dataset in the work library from the filtered sashelp.heart dataset.

Now we have 5057 observations with no missing values in the cholesterol variable. Now lets see how many distinct values in the SEX variable and check if any missing values exist. From now on we will use our heart dataset which we created at the work library.

proc sql;
select distinct SEX
from work.heart;
quit;        

So we have no missing entries, and two categories in the sex variable. Now we are ready to graphically explore relationships between variables.

Two categories with no missing values.

We will use PROC SGPLOT to achieve our goal.

proc sgplot data=work.heart;
	vbox cholesterol / category=sex connect=mean;
	title "Cholesterol Differences across Sex";
run;        

Our graph shows there is a slight cholesterol difference between genders (Two sided t-test also shows there is a difference). We connect the means with a line, as the slope of the line inreases we are more sceptical that there is a strong relation and we should add our variable to the model.

A slight difference exist between the cholesterol levels of different genders.

We can get all the graphs of categorical variables at once by using macros. So we make sure to not miss anything. We can choose the appropriate categorical variables from the variables list and look for the relationships in the graphs.

%let categorical= Smoking_Status Weight_Status BP_Status;


%macro box(dsn      = ,
           response = ,
           Charvar  = );

%let i = 1 ;

%do %while(%scan(&charvar,&i,%str( )) ^= %str()) ;

    %let var = %scan(&charvar,&i,%str( ));

    proc sgplot data=&dsn;
        vbox &response / category=&var 
                         grouporder=ascending 
                         connect=mean;
        title "&response across Levels of &var";
    run;

    %let i = %eval(&i + 1 ) ;

%end ;

%mend box;

%box(dsn      = work.heart,
     response = cholesterol,
     charvar  = &categorical);

title;        

So we have three graphs, because we define three variables in the category variables macro list. So lets take a look at one of them; the relationships between different blood pressure groups and cholesterol variable. So the difference between groups are more obvious here.


High blood pressure relates with high cholesterol amounts.

So these graphs gave us ideas about which variables to add to our model, but we mustn't solely depend on this, different effects can be possible, so we will use corellation tables also.

Visually Exploring Associations Between Variables in the Context of Linear Regression with Scatter Plots

When we want to investigate the relationship between the continuous response variable and also continuous predictor variable(s) we can use Scatter Plots to achieve our goal.

From the list of variables we can choose the numeric variables for the potential predictors. Lets do that:

%let interval= Height Weight Diastolic Systolic Smoking ;

options nolabel;
proc sgscatter data=work.heart;
    plot cholesterol*(&interval) / reg;
    title "Associations of Interval Variables with Cholesterol";
run;        

Result shows that there is a positive correlation between Diastolic and Systolic Pressure and Cholesterol.

Graphical associations of continuous variables in the context of Linear Regression.

As a conclusion, we mustn't use these plots exclusively to determine which variables to include in our model. They represent only simple relationships between one predictor variable and the response variable. When we start putting multiple variables in the model, the picture of associations can become very different. So we must take some other things into consideration like correlation of variables. But this is absolutely a good start. And maybe we can decide to examine another variable as a response variable after this analyses.

要查看或添加评论,请登录

G?KHAN YAZGAN的更多文章

  • 10 - Multiple Regression in SAS with PROC REG, PROC GLM and PROC PLM

    10 - Multiple Regression in SAS with PROC REG, PROC GLM and PROC PLM

    This time we will look at the relationship between a continuous response variable and multiple continuous predictor…

    4 条评论
  • 9 - Two-Way ANOVA Using PROC GLM and Interactions

    9 - Two-Way ANOVA Using PROC GLM and Interactions

    When we have two categorical predictor variables with multiple groups, then we use Two-Way Anova. We don't use one-way…

  • 8 - Performing Simple Linear Regression Using PROC REG in SAS

    8 - Performing Simple Linear Regression Using PROC REG in SAS

    Simple Linear Regression Simple linear regression is a statistical method used to model the relationship between two…

    2 条评论
  • 7 - Pearson Correlation in SAS with PROC CORR

    7 - Pearson Correlation in SAS with PROC CORR

    With ANOVA we examined the relationships between categorical predictor variables with continuous response variable. Now…

  • 6 - ANOVA Post Hoc Tests

    6 - ANOVA Post Hoc Tests

    Post hoc tests, also known as multiple-comparison procedures, are used to identify which specific pairs of groups…

  • 5 - One-Way Anova in SAS

    5 - One-Way Anova in SAS

    What Does One-Way Mean In "One-way ANOVA," the term "one-way" indicates that only a single independent variable…

  • 3 - One Sample t-test vs. Two Sample t Test with SAS

    3 - One Sample t-test vs. Two Sample t Test with SAS

    Some General Concepts First Parameters are evaluations of characteristics of populations. They are usually unknown and…

  • 2 - Statistical Hyphothesis Test

    2 - Statistical Hyphothesis Test

    Ho - Equality: Your null hyphothesis is usually one of equality. Ha - Inequality: Alternative hyphothesis is typically…

  • 1 - Overview of Statistical Modelling

    1 - Overview of Statistical Modelling

    Functions of Variables In our model we have response variable on the left side which is the focus of our research -…