Can machine learning help predict opioid addiction?
Health informatics is generating huge amounts of data at a rapid pace, from electronic medical records (EMRs), clinical research data, to population-level public health data. In 2014, over 2 million Americans were dependent or abused prescription opioids such as oxycodone or hydrocodone (CDC, 2017), and overdose deaths from prescription opioids have quadrupled since 1999, resulting in more than 180,000 deaths between 1999 to 2015 (NIDA, 2017, Rudd, et al., 2016). For millions of people struggling with substance abuse, addiction and relapse are chronic health conditions. What are the risk factors of opioid addiction for patients prescribed pain medications for routine medical procedures? How can the tools and techniques of data science help address the opioid crisis? One approach is to use supervised learning to identify demographic characteristics and features important for predicting prescription pain reliever abuse.
In a recent project, I compared the performance of several supervised learning procedures (e.g., linear models, decision trees, and random forests) on data from the National Survey on Drug Use and Health for 2015 (https://datafiles.samhsa.gov). The NSDUH is a comprehensive survey on all aspects of substance use, misuse, abuse, dependency, and addiction for a wide range of both prescription medications and illicit drugs, and includes a number of demographic characteristics (e.g., age, education level, employment, marital status), mental health attributes (e.g., adult depression), substance treatment, and mental health treatment. The NSDUH-2015 dataset consists of 57,146 observations with over 2,000 features, many of which are binomial: "Have you used X in the past year?" (e.g., Hydrocodone, Oxycodone, Tramadol, Morphine, Fentanyl, Oxymorphone, Demerol, Hydromorphone). Several aggregated variables were constructed for Any Pain Reliever Use, Pain Reliever Misuse and Abuse (Likert scale, 0-9), Heroin Use, Tranquilizer Use, Sedative Use, Cocaine Use, Amphetamine Use, etc.
First, the data were fit to the Lasso regression model (L1 penalty) using the glmnet package in R (Hastie, et al., 2009), which automatically calculates coefficient estimates for a wide range of lambda values. As lambda becomes very large, the lasso forces the values of many non-relevant coefficients to be equal to zero. The lasso has an advantage over ridge regression in that the resulting coefficient estimates are sparse, and only a subset of the predictors are selected in the model. Cross validation was used to select the optimal value of lambda, and the features with the highest coefficients were, in descending order: substance Treatment, Heroin use, Cocaine use, Amphetamine use, and Tranquilizer use.
Decision trees are commonly used for classification or regression and provide a solution that is easy to interpret. The data was fit to decision tree regression model that was pre-pruned to a maximum depth of 4. Substance treatment was selected as the root node at the top of the tree, with the branch to the left (low or no treatment) further dividing by Cocaine use. High scores for cocaine use branched further according to Heroin use, which ended in terminal leaf nodes. This indicates that individuals who reported using illicit drugs such as cocaine and heroin were also likely to abuse prescription pain medication. Following the right branch from the root node, high scores for treatment then divided according to Tranquilizers, suggesting that for individuals who received treatment, prescription tranquilizer use was associated with abuse of opioid pain relievers.
Random forests is an ensemble method that builds many different uncorrelated trees and then averages them to reduce variance. The advantage of random forests is that it provides a more accurate model, but can be more difficult to interpret. A random forests regression model was fit on pain reliever misuse and abuse with 500 trees and three variables considered at each split (e.g., mtry=3). The model accounted for 26 percent of the variance in opioid pain reliever misuse and abuse. The random forest model calculates feature importance by the percent increase in MSE and increase in node purity. The most important features selected for predicting pain reliever medication abuse were Tranquilizers, Treatment, Heroin use, Cocaine use, and Amphetamine use, in order of importance (tranquilizers and treatment with approximately equal ratings).
Comparing different supervised learning methods can be useful for deciding which model is the best choice for a given dataset. All of the models considered here selected the same five features as most important for predicting opioid pain reliever abuse; however, the models differed in their selection of the feature that was most informative of pain reliever abuse. A silver lining is that people who reported misusing prescription pain relievers were also likely to have received substance treatment. More than any demographic characteristic, the use of prescription tranquilizers and illicit drugs such as heroin, cocaine, or amphetamines were associated with the abuse of pain medications. Although the majority of respondents in the sample (90 percent) had never used pain reliever medication, approximately ten percent of the sample reported misusing opioid pain relievers, and only 1.6 percent reported ever using heroin. The opioid crisis may be driven in part by the widespread availability of pain medications and synthetic opioids. Additional evidence is needed in order to identify demographic characteristics associated with prescription opioid abuse. Even for people with no previous history of drug use, exposure to highly addictive opioid medications may put them at risk for opioid dependence or addiction.
References
Centers for Disease Control and Prevention (CDC, 2017). Drug overdose deaths in the United States continue to increase in 2015. https://www.cdc.gov/drugoverdose/epidemic/index.html
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. (2009). The Elements of Statistical Learning. Springer. https://web.stanford.edu/~hastie/ElemStatLearn/
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An Introduction to Statistical Learning. Springer. https://www-bcf.usc.edu/~gareth/ISL/
National Institute on Drug Addiction (NIDA, 2017). Opioid Overdose Crisis. https://www.drugabuse.gov/drugs-abuse/opioids/opioid-overdose-crisis
Rose A. Rudd, Noah Aleshire, Jon E. Ziebell, and R. Matthew Gladden. Increases in Drug and Opioid Overdose Deaths — United States, 2000–2014. Centers for Disease Control and Prevention (CDC) Morbidity and Mortality Weekly Report (MMWR). January 1, 2016 / 64(50);1378-82. https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6450a3.htm\
Product Analytics @ HubSpot
7 年Fantastic! Thanks for sharing, Sean.