Survival Analysis (II) in business: Simulated Data a Machine Learning Perspective
Diego Vallarino, PhD (he/him)
Quant | Algorithmic Forensics | Data Anthropologists | Ex-Coface, Scotiabank & Equifax | PhD, MSc, MBA | EB1A???? Green Card | Author: “Survival Model for Economics Analysis” (Amazon) | Lived in ???? ???? ???? ???? ????
1. Introduction
Survival analysis is a statistical method that is commonly used in business and microeconomics to analyze the timing of events or failure, which can help inform decision-making and optimize resource allocation. Here are some examples:
Overall, survival analysis can be a valuable tool in business and microeconomics for predicting the timing of events or failure and informing decision-making.
Imagine that for this analysis we generate 10 variables (X1...X10) of the following families:
This analysis makes a comparison of different machine learning models for the survival analysis (time to event). For this, a simulated data set is used using the coxed library (df).
A df of 2000 observations has been generated, with 10 variables, considering 30% of censored data. The analysis is presented below.
2.?Data Management
df <- sim.survdata(N=2000, T=250, xvars=10, censor=.3, num.data.frames = 1)
df<-df$data
head(df)
set.seed(123)
data.train <- sample_frac(df1, 0.7)
train_index <- as.numeric(rownames(data.train))
data.test <- df1 [-train_index, ]
surv_obj = Surv(data.test$time, data.test$status)
3. Importance of Variables (SHAP)
6 x 11 sparse Matrix of class "dgCMatrix"
4.?Traditional Models
4.1 Kaplan-Meier and Cox Models
fit3<-survfit(Surv(time, status) ~ 1, data = data.train)
fit4 <- coxph(Surv(time, status) ~ ., data=data.train, x = TRUE)
5. Machine Learning Models
5.1 MTLR Model
fit5 <- mtlr(Surv(time, status)~., data = data.train, nintervals = 9)
5.2 Survival Random Forest
fit6 <- rfsrc(Surv(time, status) ~ ., data.train)
5.3 DeepSurv Model
领英推荐
fit7<-deepsurv(data = data.train, frac = 0.5, activation = "relu",
num_nodes = c(4L, 8L, 4L, 2L), dropout = 0.3, early_stopping = TRUE,
batch_size = 32L, epochs = 100L)
5.4 Survival Kernel SVM
fit8<- survivalsvm(Surv(time, status) ~ ., data = data.train, type = "regression", gamma.mu = 1, opt.meth = "quadprog", kernel = "lin_kernel")
6.?Models Ranking List
## Cindex Model
## 1 0.529489 Cox
## 2 0.534930 MTLR
## 3 0.730064 RandForest
## 4 0.745106 DeepSurv
## 5 0.547318 KernelSVM
7.?Models Ranking Chart
8.?Conclusion
Why DeepSurv outperforms than Random Forest, KernelSVM and MTLR (which look like flipping a coin, CIndex = 0.5)?
DeepSurv is a deep learning framework designed for survival analysis, which has been shown to outperform other traditional machine learning methods such as Random Forest, Kernel SVM, and MTLR (Multi-Task Logistic Regression) in many cases.
The reason for DeepSurv’s superior performance lies in its ability to capture complex nonlinear relationships between input features and survival outcomes. In contrast, traditional methods like Random Forest, Kernel SVM, and MTLR typically assume a linear or low-dimensional relationship between features and outcomes, which can be limiting in complex datasets.
Furthermore, DeepSurv is specifically designed for survival analysis, which means it can handle censored data (i.e., instances where survival time is unknown), whereas some of the traditional machine learning methods may not handle this data effectively.
As for why Random Forest, Kernel SVM, and MTLR behave like a coin (0.5 CIndex), it is likely due to their limited ability to capture the complex relationships present in survival data. When the relationship between features and outcomes is linear or low-dimensional, these methods can perform well. However, in more complex datasets, they may struggle to capture the relevant information, resulting in poor performance.
In summary, DeepSurv’s ability to capture complex relationships and handle censored data makes it a superior choice for survival analysis compared to traditional machine learning methods like Random Forest, Kernel SVM, and MTLR, which may struggle in more complex datasets
9. Discussion
Suppose we now generate a df with 80% data censored. We see that the models have changed their performance as shown in the following chart:
What is the explanation?
When the proportion of censored data is high (such as 80%), traditional survival analysis models like Cox proportional hazards or accelerated failure time may not perform well as they assume that censoring is non-informative. This assumption means that the censoring time is unrelated to the underlying survival time, which is often not true when there is a high proportion of censored data.
Machine learning models like Random Forest and DeepSurv can handle censored data and do not require assumptions about the distribution of survival times. However, the models differ in their approach and strengths.
Random Forest is a non-parametric method that builds multiple decision trees and combines their predictions to produce a final prediction. Random Forest can handle missing data and can deal with complex interactions between predictors. Additionally, Random Forest can handle censored data without requiring any assumptions about the distribution of survival times.
DeepSurv, on the other hand, is a neural network-based approach specifically designed for survival analysis. DeepSurv uses a Cox proportional hazards model as its foundation and adds a neural network component to capture non-linear relationships. The model uses a concordance index (C-index) as its loss function, which is a measure of how well the model ranks the survival times of individuals.
While both Random Forest and DeepSurv can handle censored data, which model is better depends on the specific dataset and the research question. In general, Random Forest may be preferred when the dataset has many predictors, some of which may have non-linear relationships with the outcome. Additionally, Random Forest may be preferred when the research question does not require interpretation of the individual predictor effects.
In the case of a high proportion of censored data (such as 80%), Random Forest may be better than DeepSurv as it does not require any assumptions about the distribution of survival times and can handle complex interactions between predictors.