How to predict hospital acquired pressure injuries
Syed Imad Husain
Business Intelligence | Analytics | Data Science | I am the Teller you need for thy Data has stories to tell
Hospital acquired Pressure Injuries (HPIs) are on the rise in US. This report details different statistical modeling techniques to predict if a patient will acquire HPI along with its severity. We discuss the complete analytical framework to implement Logistic Regression (Predicting HPI) & Multinomial Regression(Predicting Severity). We also discuss the intricacies of the data that would be required and the collection process. A comparative study of the suggested model against Braden Scale in terms of predictive power can be an indicator of reliability of the suggested model in terms of usability.
Problem Statement
“How would you go about predicting whether a patient would acquire a pressure injury (aka pressure ulcer, bed sore) during their hospital stay? How would you predict the severity of the injury? What data would you utilize and what techniques would you apply?”
This problem statement can be broken down into four specific parts –
- How would you Predict Bed Sore during Hospital Stay?
- How would you predict the severity of Bed Sore?
- What data would be required?
- What techniques would you apply?
Background
Hospital acquired Bed Sores or Pressure Injuries (HPI) are on the rise. Around 1.2 Million cases of (HPIs) occurred in 2015[1]. However, most of the times, HPIs can be avoided by means of proper medical care and attention. The most widely accepted evidence-based tool to tackle this problem is the Braden Scale[2]. The scale allocates equal weights to Sensory perception, Moisture, Activity, Mobility, Nutrition and Friction & Shear. Based on these factors, it calculates a score which is classified as risk of HPIs. However, it has been observed that the scale has poor predictive power. This article revolves around developing a methodology based on Statistics and Applied Data Mining techniques to achieve a similar purpose as the Braden Scale.
Important Factors
Based on my research, I have shortlisted potential factors of HPIs which will be used in our analysis -
Data set Considerations
The data set comprises of the variables mentioned in the previous section. Since HPI can’t occur within a day and that it may require 3 to 6 days, following considerations were made -
- Grain of data is Patient & Week
- Records can be differentiated from any other record based on unique combinations of Patient and Week
- All data is gathering from similar setups to nullify the effect of medical service & infrastructure
- Data gathering starts as soon as a patient is admitted
- Data must be gathered by experienced nurses since classifying factors into classes is subjective
- Data gathering stops when the patient has left the hospital, or it contracts HPI
- Records are filtered out if
- The stay is less than a week
- The patient already suffers from PI
- The grain of data is at Patient & Week level hence we may not directly be able to feed it to the model. I thought of 3 ways to deal with this problem-
- Average out records by Patients such that each patient has 1 row
- Remove Patient ID from the data and consider all observations
- Consider the last observation for each patient, I chose this method because response is binary and cannot be averaged & considering all observations may result in auto-correlation since this data is being gathered over time and is not strictly cross-sectional
- Additionally, third Party vendors like LexisNexis & Experian Health can be used for demographic data enrichment
- All Categorical variables will require Dummy Encoding
Analyses Techniques
Predicting HPI – Logistic Regression
The simplest way to predict HPI is using Logistic Regression with hypothesis as
g(E[Y|X]) = Xβ
- g() = monotonic, differentiable link function like Logit, Probit, CC-Log, etc.
- E[Y|X] = Conditional Expectation of the response Y given the predictors X
- Y = HPI
- X = Predictor Matrix, set of all predictors i.e. variables 1 through 13
- Β = Coefficient Matrix
Before we jump into model building, level of significance for hypothesis testing must be defined. We assume α = 0.05. Various steps in Model building are –
Exploratory Data Analysis
I would create a scatter-plot Matrix among all variables to study pairwise correlation and develop intuitions about the data. In this step, we also tackle the problem of multi-col-linearity. If there is a high correlation between two predictors, we will only consider the one with higher uni variate statistical significance.
Data set Split
Data set would be split into 80% training vs 20% testing. All model building activities will be performed only on training set & testing set will be used for validation. Since we have demographic information, we may perform stratified sampling
Variable Selection
There are multiple ways of selecting variables. Generally, in this step we built multiple models to compare in the subsequent steps. Some of the commonly used Variable selection techniques are –
- Uni variate Significance Model - Combined model based on significance of individually significant variables. Each variable is considered one at a time and its statistical significance is determined using the p-value rule
- Step-wise selection – This requires using Step algorithm with directions ‘Forward’, ‘Backward’ or ‘Both’. One variable at a time is added to the model and its statistics is compared to the previous model. Any model comparison criteria can be used like AIC, BIC, etc. Final model is the model with best value for the criteria
- Best Subset – The previous method is not exhaustive and hence does not assure the best possible model. However, using Best Subset algorithm results in an exhaustive grid search of the n-dimensional parameter space for the best possible model, where n is the number of predictors. However, this is computationally very expensive
- Shrinkage Method – Another method is using Lasso or Ridge method with regularization cost penalty added to the overall model. These methods are preferred when number of parameters are greater than number of observations. It results in a Sparse model
In our case, I would build a model using each technique based on computational feasibility and efficiency
Model Comparison
Final models from the previous steps will be compared to select the best model. There are many ways for Model Comparison like AIC, BIC, Adjusted R2 , Out-of-sample MSE, etc. In our case, I will calculate out-of-sample mis-classification Rate (Total mis-classifications / Total observations, on testing data set) and select the model with the least value. ROC Curve and AUC values can also be used as a criterion for model comparison.
Asymmetric Cost & P-Cutoff (Grid Search Method)
The output of Logistic regression is the values of coefficients => Pi = [1+e(-Xβ)]-1
To make predictions, we must calculate the P-Cutoff such that when
- Pi <= P Cutoff then Y = 0
- Pi > P Cutoff then Y = 1
Since we are aware that the cost of mis-classifying a Positive as Negative is far greater than classifying the Negative as Positive, we will have to use an Asymmetric cost function which penalizes the former more than it does the latter. Using this cost function, we can determine the P-Cutoff as
- Define a sequence of Probability values. For example, 100 values between 0 and 1 with steps of 0.01
- Calculate asymmetric cost with each P value
- Visualize association between sequence of P values (x axis) and cost (y axis) also called as Elbow - Plot
- Choose the P value with minimum cost
Model Prediction
Once we determine the Cutoff Probability, we can make predictions using the logistic regression equation described in the previous section
Predicting Severity – Multinomial Regression
All steps performed above were for the binary response HPI. To predict Severity which is a multinomial response, we can use the following methods –
Multinomial Logistic Regression with Ordinal Response
Assume that a latent variable Z ~ Logistic(μ= -βX, σ2 = 1) is defined such that when
- Z < c1 then Y = 0
- Z < c2, then Y= 1
- Z < c3, then Y= 2
- Z < c4, then Y= 3
Then, P{Y <= 0} = P { Z <= c1 } = P { logit(μ,1) <= c1 } = P { logistic(0,1) <= c1 - μ } = F(c1 + βX) where,
- P{} represents probability
- F is the Inverse Logit function
- ci is the model intercept for the ith class of the response variable
Similarly, Probabilities for all classes can be calculated. This type of analysis can be performed in R using the VGAM::vglm() function. The example above demonstrates the working of the Logit Link. Other link functions can also be employed in a similar fashion.
Multinomial Logistic Regression with Nominal Response
Although we are aware of the ordinality in the levels of our Response (Severity), another way to model this response is assuming that the distribution of response is multinomial => P{Y = i } = Pi {where i is between 0 & 3} . By definition, ?(Pi) = 1. We choose one class as the baseline and interpret the log odds of all other classes as
Log(Pi/P0) = αi + βiX where
- i varies between 1 & 3
- P0 represents baseline probability
- αi represents intercept for ith class
- βi represents the coefficient matrix for ith class
This type of analysis can be performed in R using nnet::multinom() function. As in the previous case, different link functions can be applied.
These are the approaches for modeling the multinomial response Severity. All other steps remain the same as in the case of binary response HPI.
Alternative Approaches
Apart from the simple approaches elaborated above, there are few more methods which can be used interchangeably for modeling HPI & Severity. For instance,
- K-Means Clustering
- Hierarchical Clustering
- Classification Tree
- Random Forest
- Probabilistic Structural Equation Model – Bayesian Belief Network
However, these methods may result in loss of model interpret-ability.
Note- Although these are proficient & well-established alternative approaches, I have deliberately not elaborated them given the scope of this article
Appendix
- https://www.americannursetoday.com/wp-content/uploads/2018/05/DabirSupplement_May2018.pdf
- https://en.wikipedia.org/wiki/Braden_Scale_for_Predicting_Pressure_Ulcer_Risk