Survival Analysis (II) in business: Simulated Data a Machine Learning Perspective

Survival Analysis (II) in business: Simulated Data a Machine Learning Perspective

1. Introduction

Survival analysis is a statistical method that is commonly used in business and microeconomics to analyze the timing of events or failure, which can help inform decision-making and optimize resource allocation. Here are some examples:

  1. Customer churn: Survival analysis can be used to predict when a customer is likely to stop using a product or service, or "churn." This information can help businesses design targeted retention strategies to keep customers engaged.
  2. Credit risk: Survival analysis can also be used to predict the probability of default on a loan or credit card. Banks and other lenders can use this information to adjust interest rates or take other measures to minimize their risk exposure.
  3. Product failure: Survival analysis can be used to predict when a product is likely to fail or require maintenance. This information can help manufacturers plan for repairs or replacements and reduce the risk of customer dissatisfaction.
  4. Time to market: Survival analysis can also be used to analyze the time it takes for a new product or service to enter the market successfully. This information can help companies optimize their product launch strategies and reduce time to market.
  5. Employee retention: Survival analysis can be used to predict the probability of an employee leaving a company, which can help organizations design targeted retention strategies and reduce turnover rates.

Overall, survival analysis can be a valuable tool in business and microeconomics for predicting the timing of events or failure and informing decision-making.

Imagine that for this analysis we generate 10 variables (X1...X10) of the following families:

  1. Time: This is the time between a certain event and the occurrence of another event, such as the time between a product launch and the product's failure or the time between customer acquisition and churn.
  2. Customer demographics: Variables such as age, gender, income, and location can be used to segment customers and identify patterns in churn or other behaviors.
  3. Customer behavior: Variables such as purchase frequency, purchase value, and engagement levels can provide insights into customer loyalty and churn.
  4. Product or service quality: Variables such as product reviews, warranties, and repair history can be used to evaluate the quality of a product or service and predict its likelihood of failure.
  5. Competition: Variables such as market share, pricing, and customer preferences can provide insights into competitive pressures and market dynamics.
  6. Marketing and advertising: Variables such as ad spend, reach, and effectiveness can provide insights into the effectiveness of marketing and advertising campaigns.
  7. Economic conditions: Variables such as GDP, interest rates, and inflation can provide insights into the broader economic environment and its impact on business survival.
  8. Industry-specific variables: Variables such as regulatory changes, supply chain disruptions, and technological advancements can provide insights into industry-specific factors that may impact business survival.
  9. Financial metrics: Variables such as revenue, profit margins, and cash flow can provide insights into a company's financial health and its ability to withstand business disruptions.
  10. Management variables: Variables such as leadership effectiveness, employee morale, and organizational culture can provide insights into the internal factors that may impact business survival.

This analysis makes a comparison of different machine learning models for the survival analysis (time to event). For this, a simulated data set is used using the coxed library (df).

A df of 2000 observations has been generated, with 10 variables, considering 30% of censored data. The analysis is presented below.

2.?Data Management

df <- sim.survdata(N=2000, T=250, xvars=10, censor=.3, num.data.frames = 1)
df<-df$data
head(df)

set.seed(123)
data.train <- sample_frac(df1, 0.7)
train_index <- as.numeric(rownames(data.train))
data.test <- df1 [-train_index, ]

surv_obj = Surv(data.test$time, data.test$status)        

3. Importance of Variables (SHAP)

6 x 11 sparse Matrix of class "dgCMatrix"        
No alt text provided for this image
Source: own elaboration using xgboost

4.?Traditional Models

4.1 Kaplan-Meier and Cox Models

fit3<-survfit(Surv(time, status) ~ 1, data = data.train)
fit4 <- coxph(Surv(time, status) ~ ., data=data.train, x = TRUE)        
No alt text provided for this image
Source: own elaboration

5. Machine Learning Models

5.1 MTLR Model

fit5 <- mtlr(Surv(time, status)~., data = data.train, nintervals = 9)        

5.2 Survival Random Forest

fit6 <- rfsrc(Surv(time, status) ~ ., data.train)         

5.3 DeepSurv Model

fit7<-deepsurv(data = data.train, frac = 0.5, activation = "relu", 
               num_nodes = c(4L, 8L, 4L, 2L), dropout = 0.3, early_stopping = TRUE, 
               batch_size = 32L, epochs = 100L)        

5.4 Survival Kernel SVM

fit8<- survivalsvm(Surv(time, status) ~ ., data = data.train, type = "regression", gamma.mu = 1, opt.meth = "quadprog", kernel = "lin_kernel")        

6.?Models Ranking List

##     Cindex      Model
## 1 0.529489        Cox
## 2 0.534930       MTLR
## 3 0.730064 RandForest
## 4 0.745106   DeepSurv
## 5 0.547318  KernelSVM        

7.?Models Ranking Chart

No alt text provided for this image
Source: own elaboration using C-Index

8.?Conclusion

Why DeepSurv outperforms than Random Forest, KernelSVM and MTLR (which look like flipping a coin, CIndex = 0.5)?

DeepSurv is a deep learning framework designed for survival analysis, which has been shown to outperform other traditional machine learning methods such as Random Forest, Kernel SVM, and MTLR (Multi-Task Logistic Regression) in many cases.

The reason for DeepSurv’s superior performance lies in its ability to capture complex nonlinear relationships between input features and survival outcomes. In contrast, traditional methods like Random Forest, Kernel SVM, and MTLR typically assume a linear or low-dimensional relationship between features and outcomes, which can be limiting in complex datasets.

Furthermore, DeepSurv is specifically designed for survival analysis, which means it can handle censored data (i.e., instances where survival time is unknown), whereas some of the traditional machine learning methods may not handle this data effectively.

As for why Random Forest, Kernel SVM, and MTLR behave like a coin (0.5 CIndex), it is likely due to their limited ability to capture the complex relationships present in survival data. When the relationship between features and outcomes is linear or low-dimensional, these methods can perform well. However, in more complex datasets, they may struggle to capture the relevant information, resulting in poor performance.

In summary, DeepSurv’s ability to capture complex relationships and handle censored data makes it a superior choice for survival analysis compared to traditional machine learning methods like Random Forest, Kernel SVM, and MTLR, which may struggle in more complex datasets

9. Discussion

Suppose we now generate a df with 80% data censored. We see that the models have changed their performance as shown in the following chart:

No alt text provided for this image
Source: own elaboration using C-Index

What is the explanation?

When the proportion of censored data is high (such as 80%), traditional survival analysis models like Cox proportional hazards or accelerated failure time may not perform well as they assume that censoring is non-informative. This assumption means that the censoring time is unrelated to the underlying survival time, which is often not true when there is a high proportion of censored data.

Machine learning models like Random Forest and DeepSurv can handle censored data and do not require assumptions about the distribution of survival times. However, the models differ in their approach and strengths.

Random Forest is a non-parametric method that builds multiple decision trees and combines their predictions to produce a final prediction. Random Forest can handle missing data and can deal with complex interactions between predictors. Additionally, Random Forest can handle censored data without requiring any assumptions about the distribution of survival times.

DeepSurv, on the other hand, is a neural network-based approach specifically designed for survival analysis. DeepSurv uses a Cox proportional hazards model as its foundation and adds a neural network component to capture non-linear relationships. The model uses a concordance index (C-index) as its loss function, which is a measure of how well the model ranks the survival times of individuals.

While both Random Forest and DeepSurv can handle censored data, which model is better depends on the specific dataset and the research question. In general, Random Forest may be preferred when the dataset has many predictors, some of which may have non-linear relationships with the outcome. Additionally, Random Forest may be preferred when the research question does not require interpretation of the individual predictor effects.

In the case of a high proportion of censored data (such as 80%), Random Forest may be better than DeepSurv as it does not require any assumptions about the distribution of survival times and can handle complex interactions between predictors.

要查看或添加评论,请登录

Diego Vallarino, PhD (he/him)的更多文章

社区洞察

其他会员也浏览了