Health is Wealth!
Do you remember your last stay at the hospital? Was it pleasant? Did you wait a long time to get attention? We all have good and bad experiences when it comes to getting medical help. While most visits are positives, not everyone has the same “luck”. But luck shouldn’t be an important factor when visiting the hospital.
In this study, we'll analyze real hospital data using SQL. The goal is to ask questions and find insights that could help patients and medical professionals.
The Datasets
For this case, we will use 2 datasets provided by the University of California – Irvine Machine Learning Repository. Our data represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. Each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory, medications, and stayed up to 14 days.
Data overview:
We’ll be using two datasets:
health.csv: Consists of 46 columns and 101766 rows.
Some of the most important and useful values for our study are:
patient_nbr: Unique identifier of a patient.
admission_type_id: Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available.
time_in_hospital: Integer number of days between admission and discharge.
medical_specialty: Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon.
num_procedures: Number of procedures (other than lab tests) performed during the encounter.
num_lab_procedures: Number of lab tests performed during the encounter.
num_medications: Number of distinct generic names administered during the encounter.
demographics: Consists of 7 columns and 71518 rows.
The values we will use from this dataset are the following:
patient_nbr: Unique identifier of a patient.
race: Values: Caucasian, Asian, African American, Hispanic, and other.
gender: Values: male, female, and unknown/invalid.
age: Grouped in 10-year intervals: [0, 10), [10, 20),..., [90, 100).
weight: Weight in pounds.
Questions:
?Once we’ve taken a look at the data set and it’s tables, we can ask different questions in order to explore and find helpful insights:
?1.??? What’s the distribution of time spent in the hospital?
2.??? What are the busiest medical specialties and their average number of procedures?
3.??? Is there a race bias in the number of lab procedures?
4.??? Is there a correlation between the number of lab procedures and the number of days in the hospital?
5.??? How many emergency patients spend less than the average time in the hospital?
1. Time spent in the hospital
In order to understand how long patients spend at the hospital, we will have to categorize and count the cases by days. For this, we can group the data by time in spent in the hospital. We’ll call this value “bucket”. Next, we’ll count how many cases exist in each bucket. Finally, we can illustrate this with the RPAD function, using asterisk and diving it by 100 for better visualization.
We can also create two categories: More than 7 Days and Less than 7 Days. For this, we will use a counter, a CASE function and group by our categories. This will allow us to see the exact number of patients that spent less than 7 days in the hospital and the ones who didn't.
领英推荐
2. Average number of procedures per medical specialties.
First, let’s take a look at the different medical specialties that live in this data set:
Now that we see the 73 different medical specialties, we can calculate the average number of procedures for each one. We will also create a count and filter out the specialties with less than 50 patients. This will help us have a better understanding of the busiest specialties.
3. Race bias check
In order to check if there's a race bias in the number of lab procedures, we need to take a look at both tables, health and demographics. The first one contains data on the number of procedures and the second one on the patient’s race. To do this, we’ll need to join both tables, calculate the average number of lab procedures and group them by race. Let’s join the tables with this query:
Now that we have all the data in one place, we can find what the average of lab procedures from each group of patients and group the results by race:
4. Number of lab procedures and days in the hospital correlation
Before we look for a correlation between those two values, we need to find the min, max and average number of lab procedures.
?With this data, we can create 3 categories: Few, Average and Many lab procedures.
Once the categories have been created, it’s easier to find the average time in the hospital per number of lab procedures. We will also group this data by category.
5. Time spent in the hospital by emergency patients
How many emergency patients spend less than the average time in the hospital? To answer that question, first we have to find what the number of emergency patients is. We can filter by the admission type id, which for emergency patients is 1.
Now, we can create a filter that will show us the number of patients that spent less than the average time in the hospital:
Findings:
1.??? Most patients spend less than 7 days in the hospital, with a total of 80617 vs 21149 that went over the 7 days mark.
2.??? The busiest medical specialties by average number of procedures are:
Surgery-Thoracic: 3.5
Surgery-Cardiovascular/Thoracic: 3.25
Radiologist: 3.24
3.??? By looking at the data, we can conclude that there is no race bias in the number of lab procedures. All the numbers are close enough to each other.
4.??? We’ve identify a correlation between the number of lab procedures and the time spent in the hospital. The longer a patient spends at the hospital, the more lab procedures they have.
5.??? From 53990 emergency patients, 33684 spent less than the average time in the hospital. That’s a 62.3% of patients that left the facilities earlier than the average time.