What Americans Think About Their Health
Using R to Explore the Behavioral Risk Factor Surveillance System (BRFSS)

What Americans Think About Their Health

Does Real World Evidence always hold the answer? That depends what you're looking for.

A limitation of claims data in the US is that it is a biased view of whatever is billable, and whatever is billable is whatever is seen at a point of care. It's also relatively expensive to get enough of this data to test across multiple variables. What about the EMR and its diligent record of patient visits? While an EMR may have an extensive system for collecting problems, complaints, and histories, and while large language models make the interpretation of SOAP notes ever more feasible, there isn't a very good way to see this data across systems and at the community level (yet).

So if you want to know how healthy Americans are, sometimes you just have to go right to the public and ask them. A source of data for this that can't be appreciated enough is the CDC's "Behavioral Risk Factor Surveillance System (BRFSS)," (affectionately called BURR-fuss). Since 1984, the CDC has collected upwards of 320 health, behavior, and demographic data from roughly 1.5% of the US population. Information such as the prevalence of firearms in the home, chosen method of birth control, servings of fruits and vegetables eaten, and hours slept is collected alongside demographic information such as age, income, geographic location, education, urban setting as well as health status such as diagnosis of chronic diseases, height, weight, disability, and missed days of work and social events due to illness. Not only that, it is available over the course of decades.

So how can you get it?

SAS files are available on the BRFSS website so that you can play with all these variables and cases for each individual year, but for those of us using R or Python, there is a little more work involved. There is an ASCII file with which you could use read.fwf(), but the columns don't line up exactly as described here, so you might be confused why your data doesn't turn out. Instead, I recommend you use the Haven package or tidyverse and read_xpt().

This will give you all the cases (for 2022, this is 326 variables and 445,132 observations). But you still need to know what each of the variables and values mean. For this, I scrape the codebook and produce the tab-delimited file, brfss_codes.csv. If you want to improve the scraper or need it for subsequent years, you can download my code, brfss_code_read.js.

library(tidyverse);

brfss_codes <- read.csv('brfss_codes.csv', sep='\t');
brfss <- read_xpt('LLCP2022.XPT ');        

So let's try some things out.

I should preface that you should read the manual carefully. For anything meaningful, you need to consider using the recommended weighting. Just a cursory analysis will show you that the data over-represents some regions, age groups, sex, and education levels. Any methodology that stretches over years or compares regions needs to take some differences between years and regions into consideration. For today, I'm just going to look at some simple trends as examples; I'm not going to work on my dissertation.

Note that some data is categorical with options to account for those that didn't or couldn't answer, and some data is numerical, with categorical options for certain response dispositions. This is important, as a mean() on any numerical field will be mean()-ingless if you don't remove that data. Next, keep in mind that R doesn't like variables starting with an underscore, so you'll need to wrap those variables in a back-tick, but you probably knew that.

In the following code, we're taking the mean of days of poor mental health and days of poor physical health, and grouping in 5-year age groups (there are other groupings, available, but I like this one for the granularity). Imputed categorical variables start with an undercore, so _AGEG5YR is the variable for age groups, while MENTHLTH is the number of days of poor mental health and PHYSHLTH the same for physical health. Note that the value 88 means none (zero), while 77 and 99 are other dispositions, so they shouldn't be included in the mean. Only values from 1 to 30 have meaning.

I then join those results on the codes in brfss_codes. Note that I filter down to whatever variable I will want in my table or chart. Finally, I arrange it in the same order as the numeric value of the variable (this keeps the data in sequence if it is categorical; you could also use factors if you wanted). The rest is just drawing two charts on top of each other. This chart makes middle age look ideal.

brfss |> 
  group_by(variable = as.character(`_AGEG5YR`)) |>
  filter(! `MENTHLTH` %in% c('77', '99', NA),
         ! `PHYSHLTH` %in% c('77', '99', NA)) |>
  summarise(
    mntl_hlth_days = 
      mean(if_else(`MENTHLTH` == '88', 0, as.numeric(`MENTHLTH`))),
    phys_hlth_days = 
      mean(if_else(`PHYSHLTH` == '88', 0, as.numeric(`PHYSHLTH`)))
  )|>
  left_join((brfss_codes |> filter(VariableName == '_AGEG5YR')), 
             by = c('variable' = 'Value')) |>
  arrange(as.numeric(variable)) |>
  ggplot(aes(x = ValueLabel)) + 
  geom_bar(aes(y=phys_hlth_days), stat='identity', fill = 'blue',  
            position = 'dodge', width = .5) +
  geom_bar(aes(y=mntl_hlth_days), stat='identity', fill = 'red', 
            position = 'dodge', width = .2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(y = "Days of Ill Health", 
       x = "Age Group",
       title = "Mean Days of Ill Health by Age Group",
       subtitle = "Red = Mental Health; Blue = Physical Health",
       caption = "CDC BRFSS 2022 National Survey Data");
        


For the next example, I'm going to group by both age group and gender and make a simple population pyramid to show the percentage of respondents by age and gender who have been told they have a certain condition.

You'll see that I have to manually assign M and F for gender as the value label is longish in the codebook. I clean up the data into something ggplot-able using pivot_longer() to make a column for conditions. Using ggplot() I play with the bar charts to approximate a population pyramid, overriding how the x-axis is designed, and use facet_wrap() to show one chart per condition.

brfss |> 
  group_by(variable1 = as.character(`_AGEG5YR`),
           variable2 = as.character(`SEXVAR`)) |>
  filter(variable1 != 14) |> # remove unknown age group
  summarise(
    has_asthma = sum(if_else(`ASTHMA3` == '1', 1, 0, missing = 0)) / n(),
    has_depression = sum(if_else(`ADDEPEV3` == '1', 1, 0, missing = 0)) / n(),
    has_diabetes = sum(if_else(`DIABETE4` == '1', 1, 0, missing = 0)) / n(),
    has_heart_disease = sum(if_else(`CVDCRHD4` == '1', 1, 0, missing = 0)) / n(), 
    has_nonmel_cancer = sum(if_else(`CHCOCNC1` == '1', 1, 0, missing = 0)) / n(),
    has_arthritis = sum(if_else(`HAVARTH4` == '1', 1, 0, missing = 0)) / n() ) |> 
  left_join((brfss_codes |> filter(VariableName == '_AGEG5YR')), by= c('variable1'= 'Value')) |>
  arrange(as.numeric(variable1), as.numeric(variable2)) |>
  mutate(
    `Age Group` = ValueLabel,
    sex = if_else(variable2 == '1', 'M', 'F')) |>
  pivot_longer(
    cols = has_asthma:has_arthritis,
    names_to = "condition",
    names_pattern = "has_(.*)",
    values_to = "percentage") |> 
  ggplot(aes(y = `Age Group`,
             x = if_else(sex == 'M', -100 * percentage, 100 * percentage),
             fill = sex)) +
  geom_bar(stat = "identity") +
  facet_wrap(vars(condition)) +
  scale_x_continuous(labels = abs, limits = 70 * c(-1,1)) +
  labs(x = "Percent of Population", x = "",
       caption = "CDC BRFSS 2022 National Survey Data")        

This is the data by age group, but it is trivial to swap out _EDUCAG for level of education, _INCOMG1 for level of income, _URBSTAT for urban v. rural counties to see the effect of just a few social determinants of health on outcomes over time.

With 326 variables and so many cases, is there any room for AI to make some hypothesis about the state of American health? Yes! There are numerous ways to use the data to model what patients in specific areas of the country might be like and therefore what local health systems may need, or to create idealized patient profiles for more realistic case studies and simulations. The data and the methods you could deploy on it are also a rich source of needs assessment for quality improvement and educational programs.

What would you study with this data? Is this the kind of data resource you are looking for?


要查看或添加评论,请登录

Bryce Sady的更多文章

  • 100 Datasets for Healthcare Research (Part 2)

    100 Datasets for Healthcare Research (Part 2)

    If the first 59 entries focused on the artifacts of the healthcare encounter: patients, physicians, therapies, health…

  • 100 Public Data Sets for Healthcare Research (Part 1)

    100 Public Data Sets for Healthcare Research (Part 1)

    Public datasets are a cornerstone of transparency in the funding, administration, and improvement of healthcare. Even…

    5 条评论
  • Using ChatGPT in Google Sheets

    Using ChatGPT in Google Sheets

    I use R and Python when working with large datasets or performing tasks that are either tedious in a spreadsheet or…

    1 条评论
  • What happens to what you share with AI?

    What happens to what you share with AI?

    Once you get past having a conversation with a large language model (LLM) or using it like spell check, you start to…

  • Using Machine Learning To Pick a Control Group

    Using Machine Learning To Pick a Control Group

    Why would you do a post-hoc analysis? The gold standard for experimental design involves a test versus control model…

  • Using ChatGPT to Explore Claims Databases

    Using ChatGPT to Explore Claims Databases

    Incorporating ChatGPT into products and workflows has become almost indispensable. Of course, the irony is that the…

    1 条评论
  • Some Updates on Health System Affiliation

    Some Updates on Health System Affiliation

    It's been a minute. I wanted to provide an update for anyone that does work on affiliation or has to answer these kinds…

  • Thursday Tip #8. Topic Browsing with MeSH

    Thursday Tip #8. Topic Browsing with MeSH

    It's been a long minute. I've been working on a few things which should hopefully have fruition on some of these tips.

  • TT #7. Finding the basics on hospitals

    TT #7. Finding the basics on hospitals

    Last session, we connected providers to hospitals and group practices and then practices to health systems. This time…

  • Thursday Tip #6: Affiliation and hierarchical information, including health systems

    Thursday Tip #6: Affiliation and hierarchical information, including health systems

    Last week we looked at where healthcare takes place and discussed how to identify the type of provider. This week we…

    1 条评论

社区洞察

其他会员也浏览了