Analysis of Hospital Patient Data Using SQL
Christy Ehlert-Mackie
Data Analyst | Bridging Business and Technical Sides to Power Data-Driven Decisions | MSBA, MBA | Excel, SQL, Power BI, Tableau | Background in Accounting and Finance
Nearly everyone has been in a hospital at some point in their life, whether as an inpatient, to have an outpatient procedure done, or to visit someone. Hospitals are an essential part of the healthcare system, with the goals of providing high quality care and improving patient health while also keeping costs down.
According to Statista, some facts about hospitals in the US:
In this project, I imagined that I was just hired as a healthcare data analyst at a hospital. For my first assignment on the job, my boss gave me a list of questions to answer by running a series of SQL queries on their dataset.
Questions
Dataset
The data originally came from this dataset on Kaggle of hospitalization information for diabetic patients. The dataset was previously split into a demographics table and a health table for use in this project. More information about the dataset can be found on the UCI Machine Learning Repository website.
Analysis
To begin my assignment, I loaded the demographics and health tables into a MySQL database. These tables have the patient_nbr field in common which was used in some of the queries to pull information from both tables.
My first task was to provide a distribution of the length of patient stays. The hospital wants to know if the majority stay less than 7 days. The number of beds is a constraint in the hospital as a patient can only be admitted if there is a bed available for them. Hospitals want to turn over beds as quickly as possible to keep the length of stay down as well as the associated costs.
A histogram is one of the best ways to show a distribution of data. However, SQL is not a visualization tool so a workaround is to use the RPAD function. In this case, I had the function display a “|” character to symbolize the count for each bucket in terms of days.
The results show that the length of stay peaks at 3 days with 79% of hospitalizations lasting less than 7 days. The counts steadily drop as the number of days increases with just 8% staying 10 or more days.
The hospital director wants a list of the medical specialties with an average number of procedures greater than 2.5 that also have more than 50 patients. I used the AVG aggregation function to get the averages and then HAVING to filter for the criteria given. The results were grouped by medical specialty and sorted in descending order.
There are 5 medical specialties with more than 2.5 average procedures and at least 50 patients. Surgery specialties are 3 of them with the remaining two being Radiologist and Cardiology.
Next, the nurse director needed to know if patients are subconsciously being treated differently based on race as measured by the average number of lab procedures. Since race and the number of lab procedures are in different tables, a JOIN was done to connect the data using the patient_nbr field which is in both tables. The averages were grouped by race and displayed in descending order.
The query results show the average ranges from a high of 44.08 for African Americans to a low of 40.86 ?for Asians with the remaining races in between. This is not a large difference so it appears that patients of different races are not being treated differently. For more conclusive analysis, statistical testing could be done to determine if these differences in averages are statistically significant.
Approximately 2% of the patients did not have a race identified. Possible causes could be patients who identify as more than one race, patients who refused to answer the question, or errors in recording the data.
I was also asked to explore the relationship between the number of lab procedures and time spent in the hospital. The number of lab procedures were to be grouped by few (less than 25), average (25 to 54), and many (55 and greater). In order to get the groupings, I used CASE WHEN in SQL to look at the number of lab procedures for each patient and then assign it as either “few”, “average”, or “many”. Then for each group, I calculated the average time in the hospital.
领英推荐
Patients with many lab procedures spent 5.7 days in the hospital while those with average and few lab procedures spent 4.1 and 3.3 days, respectively. This pattern is expected as it is intuitive that the longer a patient is in the hospital, the more blood draws and other lab tests would be done. The patients with longer stays likely have more complicated conditions that need more monitoring.
The Research department requested a list of all patient numbers of African Americans or who have an “Up” to metformin. Metformin is a drug that controls the amount of glucose in the blood and increases the body’s response to insulin. Since race and metformin are in separate tables, I used UNION which combines the results from the SELECT statement for each table.
The query results show there are 13,735 African American patients who have an “Up” to metformin. A partial list is below.
My next task was to provide a list of all patients who had an emergency (admission_type_id = 1) but left the hospital faster than average. I used both a subquery and CTE (Common Table Expression) to get this information. I used the CTE to assign a variable name of avg_time to the subquery that calculates the average. This variable name is then used in the SELECT statement to choose the patients who had an emergency and also had a length of stay less than the average.
Below is a partial list of the query results. The hospital administrator is going to use this list to highlight some of the hospital’s success stories.
My final task was to write a summary report for the top 50 medication patients, breaking any ties with the number of lab procedures. The report needed to be in the below format. Bold indicates that it is a value pulled from the data while the rest of the text is constant.
"Patient 2383 was Caucasian and was readmitted. They had 21 medications and 32 lab procedures."
The query to write the report is complex. I used CONCAT_WS to concatenate the values together with a separator of a single space in between. CASE WHEN was used on the readmitted field to change the text depending on whether or not the patient was readmitted to the hospital. An INNER JOIN was needed to combine the data from the health and demographics tables. The results were then ordered by number of medications and then lab procedures in descending order and limited to 50 rows.
Insights
Analysis of the hospital’s data showed:
In addition, I provided:
Closing
Thank you for reading my article! This was a fun project as I am interested in the healthcare industry and I also learned some SQL statements that were new to me. I would love to hear any feedback you have. Leave a comment below or connect with me. You can also check out my data analysis project portfolio website here.
Data Analyst ?Excel ?Tableau ?SQL ? Data ? Visualization? Problem solving ? Troubleshooting
1 年Good use of bolding
Data Analyst | Project Management | Business Administration | Let's Talk Tech...
1 年Your visuals look great Christy! Job well done??.
learn ,work ,and pray
1 年Wonderfull
Healthcare Financial and Revenue Cycle Analytics Leader
1 年Ah, cool. I like these hospital projects! I had to click on this to see what you were doing with RPAD. Great job!
Great, thanks for sharing