Healthcare Analysis in MySQL
Andrew Lujan
Data Scientist & Business Intelligence Analyst | I help teams make smarter decisions through data storytelling
Introduction
Hello everyone! As someone who has started to become more interested in using data science for social good (I currently work as a data analyst solving early childhood problems) I wanted to explore some of the questions that can be answered in the Healthcare Industry. To do this, I'm going to use MySQL and some queries that will explore some of the operational challenges for hospitals.
I'll explore demographic information, but I'll also have a look at the typical length of stay on a patient level. Length of stay is particularly important because when combined with other factors like the nature of the visit or the disposition(outcome) of the visit it provides insights on a hospital's level of efficiency and from another perspective insurances companies allocate a set amount of money per patient by type of care. If a patient overstays their projected time it could be detrimental to hospital costs and insurance claims.
As the American healthcare system continues to move towards a value-based care model, I'm going to look at the dispositions along with lab procedures, and location of visit to determine whether or not patients are going through the right channels of care. A large issue in the American healthcare system related to costs is patients going into the ER for services that wouldn't require. This stems from a lack of education on the healthcare system and is what is prompting a shift to a value-based care model.
Data
The dataset originates from the UCI Machine Learning Repository and can be downloaded there or on Kaggle.
The dataset covers 10 years of clinical data from over 130 US hospitals. It has features related to patient and hospital outcomes. The columns that I'll be honing in on are:
During the course of the project I want to answer the questions below:
Fast Insights:
Deep Dive Insights (Process)
Demographics
The deep dive into the insights above will revolve around the hospital's efficiency in operations. The length of stay is an important indicator of a hospital's efficiency and a look at outcomes on a patient level as well. Earlier we said that the average length of stay for this data set was 4.4 days. To view this, I segmented the data by race and used a query grab the average length of stay by race.
According to the table above there weren't significant differences in length of stay by race. Additionally, the amount of medication administered and the average amount of lab procedures administered by race were roughly around the same. To add a little more granularity, Asian patients seemed to have the least amount of lab procedures, had the shortest stays, and took the least amount of medication indicating that this demographic has higher baseline for health. This could be to a better understanding of the healthcare system, a healthier lifestyle, and just better health in general.
Looking more critically at patient stays as a whole, I decided to use MySQL to visualize some data in the form of a histogram. Now, I know what you're saying MySQL isn't really a data visualization software, but we are going to use the RPAD function along with the '*' symbol to create a bar for our histograms and bucket them based on the number of days a patient stays.
Based on the histogram above we can see that most patients get out of the hospital faster than the national average of 5.5 days. Focusing more on the data within the dataset, the majority of patients stay less than 7 days. What we can gain from these insights is that overall the hospital is operating relatively efficiently.
Since the average profile of stay is on the shorter end, this could be a sign for hospital leadership to focus on the longer admissions and understanding the nature of these visits. Which is exactly what we'll look at next. I want to have a look at the type of stays and the effect those paired with number of procedures has on the overall stay time.
Length of Stay by Procedure
After looking at some demographic-level information related to the length of stay I wanted to have a look at the general length of stay by medical specialty and the counts as well for each stay.
领英推荐
Clearly there seem to be some issues with the way this data was collected and aggregated first of all. There several procedures where there is only a count of 1. I want to know on a macro level the amount of procedures and the average length of stay, so I need to filter the data to include medical specialty procedures with a count of more than 50 and where the average number of procedures we had is more than 1 as well. I decided to set that additional parameter at 2.5 average procedures by medical specialty. I did this using a HAVING clause on the aggregated information. I also ordered this information by the most most average procedures and limited it to the top 20 procedures if necessary.
These parameters narrowed our findings down to surgeries and radiology labs. The surgeries were typically heart-related surgers and these tended to have longer visits. Thoracic surgeries are surgeries that tend to deal with the chest and as you can see cardiovascular or heart related surgeries fell into this category as well.
Again, from a different view we can see the interaction between the number of procedures and the length of stay in the hospital. However, this correlation doesn't mean causation. We know that there is a correlation between the number of labs and the length of stay, so drilling down on this interaction could be useful and a point of analysis in a visualization for future projects.
Length of Stay by Channel of Care
I also wanted to have a look at the length of stay by the type of facility where clients checked in at. To do this I had to reformat the facility to codes into categories using a CASE WHEN statement.
Having a look at this information we can see that the longer stays tended to happen with people who came in to a trauma center or urgent care. There was a NULL field from which I wasn't able to decipher from the data dictionary. However, it seems like emergency procedures whether from trauma or something else resulted in longer stays.
Number of Procedures (categories) by Length of Stay
Next, let's summarize the length of stay by creating categories based on the number of procedures. We know that number of procedures typically has a positive correlation with the length of stay, but let's group this information based on the number of procedures and look at the average length of stay for those.
Using a CASE WHEN statement I was able to summarize the data based on the lab procedures meeting certain conditions. I chose the category "few" to represent when a case had at least zero and less than 25 cases, "average" to represent when a client had at least 25 procedures and less than 55 procedures, and finally used "many" to categorize instances when someone had more than 55 procedures. From here we could see that as the number of procedures increased the length of stay in the hospital increased.
Outcomes
Finally, I wanted to have a look at the readmission rates for patients. To do this, I wanted to compile a list of the hospital success stories where patients were discharged faster than the average amount of time spent in the hospital 4.4 days. I utilized a CTE along with a subquery as a filter to grab this information.
The filter on admission_type_id is for people who returned to the hospital via the ER. This means that those who did return were able to be discharged relatively quickly. A future route for hospital administrators is to dive deeper into these length entries and determine the dispositions/outcomes for the clients. Did the client use the ER properly? Should they have used a different channel of care like the urgent care? These are questions and areas that can improve client health and drive down the cost of care as well because clients are going through the right channels.
Client Profile
Finally, I wanted to use the CONCAT function, along with some case statements to build a client profile based on their demographic information, their patient id, the medication they've taken, and the number of lab procedures they've received.
Now we have a clean way of grabbing client information, their readmission status, their race, the number of medications they've received, and the number of lab procedures as well.
Recommendations
Next Steps
Exploration with visualization in Tableau will be part 2 of this project. I'll use the client list that was created to have a look at some of the interactions between a variety of fields in Tableau.
Again, I'd like to say thank you for everyone who read this. If there are any healthcare analytic professionals in my network I'd really appreciate any feedback for future projects. Additionally, if you haven't take a look at my portfolio website which includes projects in Python, SQL, and I'll be migrating this along with the Excel Analysis of Door Dash to that website.
Pharmacy Analytics Consultant @ Optum | Leading insights on drug utilization, cost, and pricing strategy
2 年Really insightful analysis, Andrew Lujan! Your looking at length of stay alongside readmission rates stood out for me because I haven't spent much time thinking about those two together and there could be a lot to build off there! Like, I wonder if there's potential to leverage what you've built for some realtime insights during care? Like "if this patient is discharged right now, lookalike patients (with similar diagnoses, procedures, LoS, etc.) are readmitted at [x] rate for [y] reasons." Which, elevated at the right time, then allows care teams to evaluate those concerns and adjust care plans accordingly, if it makes sense to. Hmmm.
Data Analytics | Program Evaluation | Conscious Parenting and Life Coach
2 年Thanks for posting. It inspires me to want to do more projects but I’m still spending most of my time in learning French right now.