登录查看更多内容

Survival Analysis (II) in business: Simulated Data a Machine Learning Perspective

Diego Vallarino, PhD (he/him)

Quant | Algorithmic Forensics | Data Anthropologists | Ex-Coface, Scotiabank & Equifax | PhD, MSc, MBA | EB1A???? Green Card | Author: “Survival Model for Economics Analysis” (Amazon) | Lived in ???? ???? ???? ???? ????

发布日期: 2023年4月20日

1. Introduction

Survival analysis is a statistical method that is commonly used in business and microeconomics to analyze the timing of events or failure, which can help inform decision-making and optimize resource allocation. Here are some examples:

Customer churn: Survival analysis can be used to predict when a customer is likely to stop using a product or service, or "churn." This information can help businesses design targeted retention strategies to keep customers engaged.
Credit risk: Survival analysis can also be used to predict the probability of default on a loan or credit card. Banks and other lenders can use this information to adjust interest rates or take other measures to minimize their risk exposure.
Product failure: Survival analysis can be used to predict when a product is likely to fail or require maintenance. This information can help manufacturers plan for repairs or replacements and reduce the risk of customer dissatisfaction.
Time to market: Survival analysis can also be used to analyze the time it takes for a new product or service to enter the market successfully. This information can help companies optimize their product launch strategies and reduce time to market.
Employee retention: Survival analysis can be used to predict the probability of an employee leaving a company, which can help organizations design targeted retention strategies and reduce turnover rates.

Overall, survival analysis can be a valuable tool in business and microeconomics for predicting the timing of events or failure and informing decision-making.

Imagine that for this analysis we generate 10 variables (X1...X10) of the following families:

Time: This is the time between a certain event and the occurrence of another event, such as the time between a product launch and the product's failure or the time between customer acquisition and churn.
Customer demographics: Variables such as age, gender, income, and location can be used to segment customers and identify patterns in churn or other behaviors.
Customer behavior: Variables such as purchase frequency, purchase value, and engagement levels can provide insights into customer loyalty and churn.
Product or service quality: Variables such as product reviews, warranties, and repair history can be used to evaluate the quality of a product or service and predict its likelihood of failure.
Competition: Variables such as market share, pricing, and customer preferences can provide insights into competitive pressures and market dynamics.
Marketing and advertising: Variables such as ad spend, reach, and effectiveness can provide insights into the effectiveness of marketing and advertising campaigns.
Economic conditions: Variables such as GDP, interest rates, and inflation can provide insights into the broader economic environment and its impact on business survival.
Industry-specific variables: Variables such as regulatory changes, supply chain disruptions, and technological advancements can provide insights into industry-specific factors that may impact business survival.
Financial metrics: Variables such as revenue, profit margins, and cash flow can provide insights into a company's financial health and its ability to withstand business disruptions.
Management variables: Variables such as leadership effectiveness, employee morale, and organizational culture can provide insights into the internal factors that may impact business survival.

This analysis makes a comparison of different machine learning models for the survival analysis (time to event). For this, a simulated data set is used using the coxed library (df).

A df of 2000 observations has been generated, with 10 variables, considering 30% of censored data. The analysis is presented below.

2.?Data Management

df <- sim.survdata(N=2000, T=250, xvars=10, censor=.3, num.data.frames = 1)
df<-df$data
head(df)

set.seed(123)
data.train <- sample_frac(df1, 0.7)
train_index <- as.numeric(rownames(data.train))
data.test <- df1 [-train_index, ]

surv_obj = Surv(data.test$time, data.test$status)

3. Importance of Variables (SHAP)

6 x 11 sparse Matrix of class "dgCMatrix"

No alt text provided for this image — Source: own elaboration using xgboost

4.?Traditional Models

4.1 Kaplan-Meier and Cox Models

fit3<-survfit(Surv(time, status) ~ 1, data = data.train)
fit4 <- coxph(Surv(time, status) ~ ., data=data.train, x = TRUE)

5. Machine Learning Models

5.1 MTLR Model

fit5 <- mtlr(Surv(time, status)~., data = data.train, nintervals = 9)

5.2 Survival Random Forest

fit6 <- rfsrc(Surv(time, status) ~ ., data.train)

5.3 DeepSurv Model

领英推荐

Real-World Applications of SAS: Transforming Data into…

Sankhyana Consultancy Services Pvt. Ltd. 8 个月前

Why great organizations need to insist on wisdom and…

Frank Belzer 2 年前

Supercharge Your Forecasting Skills: The Intuitive…

Fazal Khan 6 个月前

fit7<-deepsurv(data = data.train, frac = 0.5, activation = "relu", 
               num_nodes = c(4L, 8L, 4L, 2L), dropout = 0.3, early_stopping = TRUE, 
               batch_size = 32L, epochs = 100L)

5.4 Survival Kernel SVM

fit8<- survivalsvm(Surv(time, status) ~ ., data = data.train, type = "regression", gamma.mu = 1, opt.meth = "quadprog", kernel = "lin_kernel")

6.?Models Ranking List

##     Cindex      Model
## 1 0.529489        Cox
## 2 0.534930       MTLR
## 3 0.730064 RandForest
## 4 0.745106   DeepSurv
## 5 0.547318  KernelSVM

7.?Models Ranking Chart

8.?Conclusion

Why DeepSurv outperforms than Random Forest, KernelSVM and MTLR (which look like flipping a coin, CIndex = 0.5)?

DeepSurv is a deep learning framework designed for survival analysis, which has been shown to outperform other traditional machine learning methods such as Random Forest, Kernel SVM, and MTLR (Multi-Task Logistic Regression) in many cases.

The reason for DeepSurv’s superior performance lies in its ability to capture complex nonlinear relationships between input features and survival outcomes. In contrast, traditional methods like Random Forest, Kernel SVM, and MTLR typically assume a linear or low-dimensional relationship between features and outcomes, which can be limiting in complex datasets.

Furthermore, DeepSurv is specifically designed for survival analysis, which means it can handle censored data (i.e., instances where survival time is unknown), whereas some of the traditional machine learning methods may not handle this data effectively.

As for why Random Forest, Kernel SVM, and MTLR behave like a coin (0.5 CIndex), it is likely due to their limited ability to capture the complex relationships present in survival data. When the relationship between features and outcomes is linear or low-dimensional, these methods can perform well. However, in more complex datasets, they may struggle to capture the relevant information, resulting in poor performance.

In summary, DeepSurv’s ability to capture complex relationships and handle censored data makes it a superior choice for survival analysis compared to traditional machine learning methods like Random Forest, Kernel SVM, and MTLR, which may struggle in more complex datasets

9. Discussion

Suppose we now generate a df with 80% data censored. We see that the models have changed their performance as shown in the following chart:

What is the explanation?

When the proportion of censored data is high (such as 80%), traditional survival analysis models like Cox proportional hazards or accelerated failure time may not perform well as they assume that censoring is non-informative. This assumption means that the censoring time is unrelated to the underlying survival time, which is often not true when there is a high proportion of censored data.

Machine learning models like Random Forest and DeepSurv can handle censored data and do not require assumptions about the distribution of survival times. However, the models differ in their approach and strengths.

Random Forest is a non-parametric method that builds multiple decision trees and combines their predictions to produce a final prediction. Random Forest can handle missing data and can deal with complex interactions between predictors. Additionally, Random Forest can handle censored data without requiring any assumptions about the distribution of survival times.

DeepSurv, on the other hand, is a neural network-based approach specifically designed for survival analysis. DeepSurv uses a Cox proportional hazards model as its foundation and adds a neural network component to capture non-linear relationships. The model uses a concordance index (C-index) as its loss function, which is a measure of how well the model ranks the survival times of individuals.

While both Random Forest and DeepSurv can handle censored data, which model is better depends on the specific dataset and the research question. In general, Random Forest may be preferred when the dataset has many predictors, some of which may have non-linear relationships with the outcome. Additionally, Random Forest may be preferred when the research question does not require interpretation of the individual predictor effects.

In the case of a high proportion of censored data (such as 80%), Random Forest may be better than DeepSurv as it does not require any assumptions about the distribution of survival times and can handle complex interactions between predictors.

Porandu

2,534 位关注者

要查看或添加评论，请登录

Diego Vallarino, PhD (he/him)的更多文章

DeepSeek, BYD and the Repricing of Innovation: Why the Market Reaction Was Rational, Not Overblown.

2025年3月28日

DeepSeek, BYD and the Repricing of Innovation: Why the Market Reaction Was Rational, Not Overblown.

When DeepSeek launched on February 15, 2025, the global AI and financial communities took notice. Headlines across…
Tradable vs. Non-Tradable Sectors: What DIA and SPY Tell Us About the U.S. Economy in 2025.

2025年3月18日

Tradable vs. Non-Tradable Sectors: What DIA and SPY Tell Us About the U.S. Economy in 2025.

In October 2023, I analyzed how the DIA (Dow Jones Industrial Average ETF) and SPY (S&P 500 ETF) provide unique…
AI’s Race to Zero: How Open Models, Cheap Training, and Data Wars are Redefining Value.

2025年3月12日

AI’s Race to Zero: How Open Models, Cheap Training, and Data Wars are Redefining Value.

Artificial intelligence (AI) has evolved from an emerging technology to a core driver of economic transformation across…

5 条评论
Unraveling Economic Networks: How Graph Neural Networks Transform Finance, Innovation, and Labor Mobility.

2025年3月6日

Unraveling Economic Networks: How Graph Neural Networks Transform Finance, Innovation, and Labor Mobility.

1. The Power of Complex Networks in Economics What if your chances of securing a loan were not only determined by your…
Why is DeepSeek More Efficient than ChatGPT?: The Library Analogy.

2025年2月28日

Why is DeepSeek More Efficient than ChatGPT?: The Library Analogy.

This week, we had an amazing webinar with over 300 attendees connected, discussing "What comes after DeepSeek?". Thanks…
Navigating the Unknown: A Conversation on New Careers, Industries, and the Power of Adaptation.

2025年2月26日

Navigating the Unknown: A Conversation on New Careers, Industries, and the Power of Adaptation.

February 2025: “Dad, what’s the point of studying a career if artificial intelligence is going to replace all jobs?”…
Algorithmic Models for Accelerating the Training and Implementation of LLMs

2025年2月18日

Algorithmic Models for Accelerating the Training and Implementation of LLMs

Large Language Models (LLMs) have revolutionized artificial intelligence and its applications across various…

1 条评论
Advancing Public Policy Design with Machine Learning and Artificial Intelligence: A Case for Evidence-Based Policymaking.

2025年2月6日

Advancing Public Policy Design with Machine Learning and Artificial Intelligence: A Case for Evidence-Based Policymaking.

In the current swiftly evolving socio-economic environment, the necessity for novel instruments to inform public policy…
DeepSeek and the AI Bubble: Are We Underestimating Disruption?

2025年1月30日

DeepSeek and the AI Bubble: Are We Underestimating Disruption?

The rapid advancements in artificial intelligence (AI) have led to an unprecedented boom in the valuation of technology…

1 条评论
The Viability of the Stargate Project: A Game-Theoretic and Power Dynamics Analysis.

2025年1月22日

The Viability of the Stargate Project: A Game-Theoretic and Power Dynamics Analysis.

The announcement of the Stargate Project marks a significant shift in the landscape of artificial intelligence (AI)…

3 条评论

See all articles

Survival Analysis (II) in business: Simulated Data a Machine Learning Perspective

Diego Vallarino, PhD (he/him)

Quant | Algorithmic Forensics | Data Anthropologists | Ex-Coface, Scotiabank & Equifax | PhD, MSc, MBA | EB1A???? Green Card | Author: “Survival Model for Economics Analysis” (Amazon) | Lived in ???? ???? ???? ???? ????

领英推荐

Porandu

2,534 位关注者

Diego Vallarino, PhD (he/him)的更多文章

社区洞察

其他会员也浏览了

Why Internal Insights Matter Before Market Research?

Getting tactical about being Strategic

DATA PRODUCT

What is an Insight ??

Time Series Analysis: Forecasting the Future

The Magic of Analytical Skills

Analytical Thinking

Unveiling the Power of Time Series Analysis in Predictive Modeling

Indulge in a data feast: Illuminating the path of business decisions with the hedonic model

From Data to Insights: How to Interpret and Act on Your Analytics

领英推荐

Porandu

2,534 位关注者

Diego Vallarino, PhD (he/him)的更多文章

DeepSeek, BYD and the Repricing of Innovation: Why the Market Reaction Was Rational, Not Overblown.

Tradable vs. Non-Tradable Sectors: What DIA and SPY Tell Us About the U.S. Economy in 2025.

AI’s Race to Zero: How Open Models, Cheap Training, and Data Wars are Redefining Value.

Unraveling Economic Networks: How Graph Neural Networks Transform Finance, Innovation, and Labor Mobility.

Why is DeepSeek More Efficient than ChatGPT?: The Library Analogy.

Navigating the Unknown: A Conversation on New Careers, Industries, and the Power of Adaptation.

Algorithmic Models for Accelerating the Training and Implementation of LLMs

Advancing Public Policy Design with Machine Learning and Artificial Intelligence: A Case for Evidence-Based Policymaking.

DeepSeek and the AI Bubble: Are We Underestimating Disruption?

The Viability of the Stargate Project: A Game-Theoretic and Power Dynamics Analysis.

社区洞察

其他会员也浏览了

Why Internal Insights Matter Before Market Research?

Getting tactical about being Strategic

DATA PRODUCT

What is an Insight ??

Time Series Analysis: Forecasting the Future

The Magic of Analytical Skills

Analytical Thinking

Unveiling the Power of Time Series Analysis in Predictive Modeling

Indulge in a data feast: Illuminating the path of business decisions with the hedonic model

From Data to Insights: How to Interpret and Act on Your Analytics