XGBoost model for predicting mortality in chronic kidney disease and the importance of the top 10 features
Office of the National Coordinator for Health Information Technology

XGBoost model for predicting mortality in chronic kidney disease and the importance of the top 10 features

We recently published this article from a project with Assistant Secretary for Technology Policy , Booz Allen Hamilton , University of California, San Francisco on our models for predicting mortality for chronic kidney disease patients within 90 days of dialysis. The goal of this project was to develop a high quality training dataset and demonstrate some of the different types of models that can be created. The data was cleaned and organized using R and the XGBoost model was created in R. The code is at the github link below.

This dataset was obtained from USRDS and contained 188 features (predictors). I expected this number of rich features to give us a serious advantage, but to my surprise when we ran the XGBoost model with only the top 10 (most important according to XGBoost) features the c-statistic (AUC - area under the curve) was not much lower than the full model (c=0.78 vs. c=0.826). Another thing we tested was to have XGBoost natively handle the missing data (for continuous features) vs. creating multiple imputations (MICE). The results for these two models (AUC) were very similar (c=0.826, vs. imputed c=0.827). The clinicians were not as surprised by this, as they understand the clinical use case at a much deeper level than I ever will. As a data scientist, I thought the predictive power of these 10 features (for this dataset) was impressive and interesting. For details, see the article or the code, or send me any questions that you have.

Lucy Han , Rebecca Scherzer , Michelle Estrella, MD, MHS , Michael G. Shlipak

James Sanders

Data Scientist, Applied Machine Learning | ex-Wayfair, DataRobot, TXU | Co-founder

1 年

Thanks Summer! I noticed there was no discussion of data censoring, which is often a factor in survival analysis. Was data censoring a consideration?

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了