XGBoost model for predicting mortality in chronic kidney disease and the importance of the top 10 features
We recently published this article from a project with Assistant Secretary for Technology Policy , Booz Allen Hamilton , University of California, San Francisco on our models for predicting mortality for chronic kidney disease patients within 90 days of dialysis. The goal of this project was to develop a high quality training dataset and demonstrate some of the different types of models that can be created. The data was cleaned and organized using R and the XGBoost model was created in R. The code is at the github link below.
This dataset was obtained from USRDS and contained 188 features (predictors). I expected this number of rich features to give us a serious advantage, but to my surprise when we ran the XGBoost model with only the top 10 (most important according to XGBoost) features the c-statistic (AUC - area under the curve) was not much lower than the full model (c=0.78 vs. c=0.826). Another thing we tested was to have XGBoost natively handle the missing data (for continuous features) vs. creating multiple imputations (MICE). The results for these two models (AUC) were very similar (c=0.826, vs. imputed c=0.827). The clinicians were not as surprised by this, as they understand the clinical use case at a much deeper level than I ever will. As a data scientist, I thought the predictive power of these 10 features (for this dataset) was impressive and interesting. For details, see the article or the code, or send me any questions that you have.
Data Scientist, Applied Machine Learning | ex-Wayfair, DataRobot, TXU | Co-founder
1 年Thanks Summer! I noticed there was no discussion of data censoring, which is often a factor in survival analysis. Was data censoring a consideration?