登录查看更多内容

Feature Engineering for Health Analytics

Eddie Jay

FSA Actuary, Let's make Analytics POP!

发布日期: 2018年7月11日

Feature engineering is an important step in analytics. Some may say this is THE most important step. I typically spend between 50-70% of the time of an analytic exercise on feature engineering.

What is it:

In the predictive modeling context, features are basically elements of the data. E.g. age of patients, diagnoses and procedures patients have had are all features of the data.

Feature engineering is the process of creating new features based on the raw data.

This process typically requires domain expertise to identify relevant aspects of the data/features that are most relevant to the analysis. E.g. working with oncologists to create profiles of metastatic cancer based on line level claims and EHR data.

A fair amount of feature engineering will also be quantitatively driven, where you iteratively create and refine features based on predictive modeling output. E.g. splitting age into finer age groups after seeing age as a whole is a highly predictive feature in a generalized linear model.

Purpose

Here are few reasons for feature engineering:

Reduce errors:

Health data can contain many errors and data structure inconsistencies. You will do best to spend time up front to clean up the data and get a good feel of what you're working with.

Reduce noise:

Raw data can contain many components that are numerous and highly specific. Furthermore, the accuracy of coding at such specific level may be questionable.
E.g. there are 70,000+ ICD10s diagnoses: while E11 indicates type 2 diabetes, E11.621 indicates type 2 diabetes with foot ulcer, which is very specific. E11 contains just over 100 codes, and there is only one E11.621 code. So with the full digits, each ICD10 code would have fewer data points, thus reducing the predictive power dramatically.
Adding labels that aggregate individual code components reduces such dispersion and improves the predictive power. E.g. in most analytic exercises, knowing whether a patient had type 2 diabetes is sufficient, thus using E11 is preferable to using E11.621.

Predictive model training:

Features that aggregate over individual data elements remove noise and allow the identification of stronger effect sizes using predictive models and fast training of models with higher goodness of fit outcomes.

Add insight layers:

Sometimes, codes in the raw form may be insufficient, so adding extra intelligence layers based on domain expertise may be useful.
E.g. some NDC codes indicate combinations of different ingredients, such as Exforge 10mg-320mg (NDC 00078049115) contains two ingredients, amlodipine and valsartan. The NDC alone would not identify both active ingredients.

Actionable insight detection

When doing analyses, you typically have an idea in mind that you want test or at least some notion of where to find issues. If you can design features that mimic your hypotheses, then you can use analyses and predictive models to test whether your hypotheses are correct.
E.g. if you know a priori what some patients experience preceding onset of opioid abuse, you may be able to build these features from data and identify optimal intervention points to prevent the onset of opioid abuse.

How to do it

There are numerous types of data transforms that can be used in feature engineering. Here are a few examples:

Over time

Temporal measures allow detection of changes in events over time that are highly informative. E.g. whether someone had a stroke in the past year is highly informative of their risk of further stroke risks as well as needs for additional recuperative care in community.

Intelligence/insight layers

Discussed above, adding aggregation layers will reduce noise while adding specific insight layers allows more targeted analyses. Most large insurers and EHR vendors have some sort of code categorization that you will come across and can use. But there are many situations in which you would need to create your own intelligence layer for specific use cases.

Boundaries/Norms

Setting boundaries could help you identify abnormal occurrences, e.g. laboratory test being too high or Blood Pressure being too low for a given patient. Conversely, you can specify whether a value is within the normal range. In medicine, these normal/abnormal thresholds are often specified in the clinical guidelines.
You can thus converted a continuous set of data into binary, of whether someone was above or below given thresholds. This is a lot more informative than looking at all lab values or Blood Pressure values for the analysis.

Dependent on predictive models

Different predictive models require different types of input items. Logistic regressions perform well with binary features and outcomes; random forests tend to do better with continuous variables; k-means clusters tend to do well with nominal variables. So know what type of models might work best given your analysis and then build types of features that best feed those models.

Ultimately, feature engineering is an iterative exercise. You build an initial set of features, do some analyses, learn which features are more useful and build additional subsets of features and so on. E.g. you may find age is a useful variable. Then you split age into >=65 and <65. Then you find those above 65 are important, then you might split >65s into 5 year age bands.

Thanks for reading! Please subscribe.

要查看或添加评论，请登录

Eddie Jay的更多文章

Building Scalable Analytics Pipelines

2022年5月4日

Building Scalable Analytics Pipelines

Analytic pipelines are processes through which raw data are transformed into insights that are then delivered to the…
Automating Healthcare Fraud Detection

2020年2月7日

Automating Healthcare Fraud Detection

Fraud costs the US health system a lot. As do wasteful spending and abuse of health services.
Pharmacy Fraud Waste Abuse

2019年7月2日

Pharmacy Fraud Waste Abuse

The US spends nearly $400bn annually on pharmaceutical drugs. Some estimates put the amount of fraud waste and abuse…
Pharmaceutical Analytics - 2

2019年4月18日

Pharmaceutical Analytics - 2

This is No.2 of a series of blogs I'm writing on pharmaceutical analytics.
What's in it for me anyway? Incentives in healthcare

2019年3月9日

What's in it for me anyway? Incentives in healthcare

Incentives, as some economists would suggest, drive the world. Whether someone is doing things to fill their stomach…

1 条评论
Applying AI in the real "healthcare" world

2019年1月16日

Applying AI in the real "healthcare" world

My previous post discussed Challenges to doing ML in healthcare. This one suggests ways to apply AI in that overcome…
Challenges of applying machine learning to healthcare

2018年12月20日

Challenges of applying machine learning to healthcare

A number of trends have paved the way for increasing adoption of machine learning (ML) in healthcare. We’re capturing…
A health data whisperer’s tricks

2018年11月14日

A health data whisperer’s tricks

Some of you may remember the 1998 movie The Horse Whisperer. In it, Robert Redford’s character, who had a remarkable…
Health Insurance Analytics Metrics

2018年8月20日

Health Insurance Analytics Metrics

Health insurance is primarily in the business of receiving premium from policyholders and paying for their medical…
Practical considerations for Predictive Modeling

2018年6月7日

Practical considerations for Predictive Modeling

Advances in computing power and in machine learning techniques are rapidly changing how humans utilize data. Aside from…

See all articles

Feature Engineering for Health Analytics

Eddie Jay

FSA Actuary, Let's make Analytics POP!

What is it:

Purpose

Reduce errors:

Reduce noise:

Predictive model training:

Add insight layers:

Actionable insight detection

How to do it

Over time

Intelligence/insight layers

Boundaries/Norms

Dependent on predictive models

Eddie Jay的更多文章

社区洞察

其他会员也浏览了

The Mathematics Mystery in Healthcare and Quality

Feature Engineering: Boosting Your Data for Better Model Performance

The Interplay of Methodology, Hypothesis Testing, and Precision Data in Linear Regression

Fast and Slow Decision Making with Data Visualizations

Article Series: Advanced Data Science Techniques in Predictive Modelling

Understanding Scatter Plots: A Comprehensive Guide

Person-Centered Approaches: Latent Profile Analysis (LPA) & Latent Class Analysis (LCA) ?(2/5) ??

Understanding MARS? Regression with Minitab

Feature Engineering should integrate subject experts

Evaluations Metrics

What is it:

Purpose

Reduce errors:

Reduce noise:

Predictive model training:

Add insight layers:

Actionable insight detection

How to do it

Over time

Intelligence/insight layers

Boundaries/Norms

Dependent on predictive models

Eddie Jay的更多文章

Building Scalable Analytics Pipelines

Automating Healthcare Fraud Detection

Pharmacy Fraud Waste Abuse

Pharmaceutical Analytics - 2

What's in it for me anyway? Incentives in healthcare

Applying AI in the real "healthcare" world

Challenges of applying machine learning to healthcare

A health data whisperer’s tricks

Health Insurance Analytics Metrics

Practical considerations for Predictive Modeling

社区洞察

其他会员也浏览了

The Mathematics Mystery in Healthcare and Quality

Feature Engineering: Boosting Your Data for Better Model Performance

The Interplay of Methodology, Hypothesis Testing, and Precision Data in Linear Regression

Fast and Slow Decision Making with Data Visualizations

Article Series: Advanced Data Science Techniques in Predictive Modelling

Understanding Scatter Plots: A Comprehensive Guide

Person-Centered Approaches: Latent Profile Analysis (LPA) & Latent Class Analysis (LCA) ?(2/5) ??

Understanding MARS? Regression with Minitab

Feature Engineering should integrate subject experts

Evaluations Metrics