What Americans Think About Their Health
Does Real World Evidence always hold the answer? That depends what you're looking for.
A limitation of claims data in the US is that it is a biased view of whatever is billable, and whatever is billable is whatever is seen at a point of care. It's also relatively expensive to get enough of this data to test across multiple variables. What about the EMR and its diligent record of patient visits? While an EMR may have an extensive system for collecting problems, complaints, and histories, and while large language models make the interpretation of SOAP notes ever more feasible, there isn't a very good way to see this data across systems and at the community level (yet).
So if you want to know how healthy Americans are, sometimes you just have to go right to the public and ask them. A source of data for this that can't be appreciated enough is the CDC's "Behavioral Risk Factor Surveillance System (BRFSS)," (affectionately called BURR-fuss). Since 1984, the CDC has collected upwards of 320 health, behavior, and demographic data from roughly 1.5% of the US population. Information such as the prevalence of firearms in the home, chosen method of birth control, servings of fruits and vegetables eaten, and hours slept is collected alongside demographic information such as age, income, geographic location, education, urban setting as well as health status such as diagnosis of chronic diseases, height, weight, disability, and missed days of work and social events due to illness. Not only that, it is available over the course of decades.
So how can you get it?
SAS files are available on the BRFSS website so that you can play with all these variables and cases for each individual year, but for those of us using R or Python, there is a little more work involved. There is an ASCII file with which you could use read.fwf(), but the columns don't line up exactly as described here, so you might be confused why your data doesn't turn out. Instead, I recommend you use the Haven package or tidyverse and read_xpt().
This will give you all the cases (for 2022, this is 326 variables and 445,132 observations). But you still need to know what each of the variables and values mean. For this, I scrape the codebook and produce the tab-delimited file, brfss_codes.csv. If you want to improve the scraper or need it for subsequent years, you can download my code, brfss_code_read.js.
library(tidyverse);
brfss_codes <- read.csv('brfss_codes.csv', sep='\t');
brfss <- read_xpt('LLCP2022.XPT ');
So let's try some things out.
I should preface that you should read the manual carefully. For anything meaningful, you need to consider using the recommended weighting. Just a cursory analysis will show you that the data over-represents some regions, age groups, sex, and education levels. Any methodology that stretches over years or compares regions needs to take some differences between years and regions into consideration. For today, I'm just going to look at some simple trends as examples; I'm not going to work on my dissertation.
Note that some data is categorical with options to account for those that didn't or couldn't answer, and some data is numerical, with categorical options for certain response dispositions. This is important, as a mean() on any numerical field will be mean()-ingless if you don't remove that data. Next, keep in mind that R doesn't like variables starting with an underscore, so you'll need to wrap those variables in a back-tick, but you probably knew that.
In the following code, we're taking the mean of days of poor mental health and days of poor physical health, and grouping in 5-year age groups (there are other groupings, available, but I like this one for the granularity). Imputed categorical variables start with an undercore, so _AGEG5YR is the variable for age groups, while MENTHLTH is the number of days of poor mental health and PHYSHLTH the same for physical health. Note that the value 88 means none (zero), while 77 and 99 are other dispositions, so they shouldn't be included in the mean. Only values from 1 to 30 have meaning.
I then join those results on the codes in brfss_codes. Note that I filter down to whatever variable I will want in my table or chart. Finally, I arrange it in the same order as the numeric value of the variable (this keeps the data in sequence if it is categorical; you could also use factors if you wanted). The rest is just drawing two charts on top of each other. This chart makes middle age look ideal.
领英推荐
brfss |>
group_by(variable = as.character(`_AGEG5YR`)) |>
filter(! `MENTHLTH` %in% c('77', '99', NA),
! `PHYSHLTH` %in% c('77', '99', NA)) |>
summarise(
mntl_hlth_days =
mean(if_else(`MENTHLTH` == '88', 0, as.numeric(`MENTHLTH`))),
phys_hlth_days =
mean(if_else(`PHYSHLTH` == '88', 0, as.numeric(`PHYSHLTH`)))
)|>
left_join((brfss_codes |> filter(VariableName == '_AGEG5YR')),
by = c('variable' = 'Value')) |>
arrange(as.numeric(variable)) |>
ggplot(aes(x = ValueLabel)) +
geom_bar(aes(y=phys_hlth_days), stat='identity', fill = 'blue',
position = 'dodge', width = .5) +
geom_bar(aes(y=mntl_hlth_days), stat='identity', fill = 'red',
position = 'dodge', width = .2) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(y = "Days of Ill Health",
x = "Age Group",
title = "Mean Days of Ill Health by Age Group",
subtitle = "Red = Mental Health; Blue = Physical Health",
caption = "CDC BRFSS 2022 National Survey Data");
For the next example, I'm going to group by both age group and gender and make a simple population pyramid to show the percentage of respondents by age and gender who have been told they have a certain condition.
You'll see that I have to manually assign M and F for gender as the value label is longish in the codebook. I clean up the data into something ggplot-able using pivot_longer() to make a column for conditions. Using ggplot() I play with the bar charts to approximate a population pyramid, overriding how the x-axis is designed, and use facet_wrap() to show one chart per condition.
brfss |>
group_by(variable1 = as.character(`_AGEG5YR`),
variable2 = as.character(`SEXVAR`)) |>
filter(variable1 != 14) |> # remove unknown age group
summarise(
has_asthma = sum(if_else(`ASTHMA3` == '1', 1, 0, missing = 0)) / n(),
has_depression = sum(if_else(`ADDEPEV3` == '1', 1, 0, missing = 0)) / n(),
has_diabetes = sum(if_else(`DIABETE4` == '1', 1, 0, missing = 0)) / n(),
has_heart_disease = sum(if_else(`CVDCRHD4` == '1', 1, 0, missing = 0)) / n(),
has_nonmel_cancer = sum(if_else(`CHCOCNC1` == '1', 1, 0, missing = 0)) / n(),
has_arthritis = sum(if_else(`HAVARTH4` == '1', 1, 0, missing = 0)) / n() ) |>
left_join((brfss_codes |> filter(VariableName == '_AGEG5YR')), by= c('variable1'= 'Value')) |>
arrange(as.numeric(variable1), as.numeric(variable2)) |>
mutate(
`Age Group` = ValueLabel,
sex = if_else(variable2 == '1', 'M', 'F')) |>
pivot_longer(
cols = has_asthma:has_arthritis,
names_to = "condition",
names_pattern = "has_(.*)",
values_to = "percentage") |>
ggplot(aes(y = `Age Group`,
x = if_else(sex == 'M', -100 * percentage, 100 * percentage),
fill = sex)) +
geom_bar(stat = "identity") +
facet_wrap(vars(condition)) +
scale_x_continuous(labels = abs, limits = 70 * c(-1,1)) +
labs(x = "Percent of Population", x = "",
caption = "CDC BRFSS 2022 National Survey Data")
This is the data by age group, but it is trivial to swap out _EDUCAG for level of education, _INCOMG1 for level of income, _URBSTAT for urban v. rural counties to see the effect of just a few social determinants of health on outcomes over time.
With 326 variables and so many cases, is there any room for AI to make some hypothesis about the state of American health? Yes! There are numerous ways to use the data to model what patients in specific areas of the country might be like and therefore what local health systems may need, or to create idealized patient profiles for more realistic case studies and simulations. The data and the methods you could deploy on it are also a rich source of needs assessment for quality improvement and educational programs.
What would you study with this data? Is this the kind of data resource you are looking for?