登录查看更多内容

Survival Curves in KNIME with Python and Plotly

John D.

发布日期: 2023年5月26日

BLUF: Here is a handy Python Lifelines and Plotly powered #knime component to make Kaplan-Meier plotting and analysis easy.

KNIME: Download it free here. The component in this article requires v4.7+

KM Curve Component: Download it free here. (I am finishing a version that will work with KNIME <4.7).

Further Reading:

Read: Chapters 3 & 6

4. Marketing Analytics

Read: Chapter 8

5. Survival Analysis Part I: Basic Concepts

6. Inspiration for the KNIME component here.

**Update** After publishing this article I saw it fit to update the component. It now includes the options to select the desired logrank test and the weighting. I will work Fleming-Harrington in as soon as I can.

No alt text provided for this image — Drop-downs for logrank and weighting choices

It’s been some time since I’ve posted and even longer since I’ve written anything about KNIME. Due to hosting issues, my website knime.tips is down as I evaluate options. In the meantime, I wanted to share a tool I’ve been working on for a little while in hopes that someone might find it useful.

Disclaimer upfront, this article is not designed to provide a crash course in survival analysis. If you want to learn more about the topic please see the resources I've put together above in the 'further reading' section. Those are fantastic primary sources.

Survival Analysis is an analytic approach that serves to address questions around, “time until an event occurs” (Klein p. 4). Generally used in the medical field, survival analysis can actually be applied to many kinds of problems including those in Human Resources, Marketing and others. Outputs of survival analysis, “...give the probability that a study subject survives past a specified time” with a plotted Kaplan-Meier curve depicting this relationship in an easy to interpret fashion (Klein p. 63). There are many types of survival analysis with Kaplan-Meier being the most simple and straightforward of the options. Other options include Cox Regression, Random Survival Forests and many more.

KNIME already offers a Kaplan-Meier Estimator node that does a fine job and outputs survival curves. You can find it here. I wanted to take this a step further and leverage Python Lifelines and Plotly to generate more detailed outputs and interactivity, something that have not seen anywhere on the web as of yet.

To get this working of course you need the Python Lifelines and Plotly libraries installed and Python configured in KNIME. If you have not configured Python in KNIME please see this detailed integration guide. Once your Python environment is ready, drag the component to your workflow, connect the Conda Environment Propagation node and connect some data to it.

Since you already have Lifelines installed you can use a Python Script node to pull in some test data like the Rossi arrest recidivism dataset. You can load the data with the code below:

import knime.scripting.io as knio
from lifelines.datasets import load_rossi

input_table = load_rossi()

knio.output_tables[0] = knio.Table.from_pandas(input_table)

If you are using your own data ensure that your survival status data is encoded where 1 indicates a failure occurred and 0 indicates not failure.

Additionally, ensure that the time column is a continuous integer or double value.

Drag the component into your work flow, connect your data and double-click the component to bring up the options.

For Variable choose the column that has the classes you want to plot survival curves for.

Select your survival status column - this column has the binary flag to indicate failure/not failure, survival/not survival, quit/not quit etc.

领英推荐

What are Sets in Python and How to use them? NareshIT

Naresh i Technologies 2 年前

Make your code more Pythonic with Magic Methods

Profil Software 2 年前

Structural Pattern Matching in Python III

Coditation 2 年前

Select your time column. Again, this column can be a double or integer and represents the timeframe you are working within (you should know if it's weeks, months, years etc.).

Enter the values you would like for the x-label, y-label and title of the output plots.

Provide an alpha value of your choice. This impacts how the confidence intervals are shaded on the output Kaplan-Meier plots and that number is tied to a level of significance. You may want to change this value based on your specific project and/or the domain you are operating within.

The plot resolution is a value that allows you control the number of time points calculated in the curves. The value is passed in kmf.fit() in the timelines argument. At the time of this writing if you pass 0 into this field, kmf.fit() will use the built-in default behavior for calculating this.

Finally, if you are interested in only seeing a subsection of the plotted curves you can enter your start and end choices here. If the checkbox for Max End Time For Timeline? is selected, the entire plot will generate. This only impacts the static plot generated from the Lifelines package.

As a note, Lifelines is an incredibly full featured and robust library. There is a lot I did not include in this component for the sake of ease of use. It is fairly easy to add additional capabilities specific to your problem set. It should be noted that there are many more static plot options parameters that you can tweak that I did not provide access to, such as transparency choices, lines, legends, etc. I encourage you to explore the Lifelines documentation to learn more.

When you right click the node and select Execute and Open Views, a dashboard is generated with two plots, and a few tables of data.

The Plotly output is on the left. It is fully interactive and allows you to select and de-select the classes in the legend. I built it to mirror the static plot as closely as possible in look and feel. You are welcome to explore the code and adjust transparencies, lines as you see fit.

*Note* The big difference between my Plotly code and the code in the inspiration link at the top of this article, is that the Plotly KM curves are being generated in Plotly and not converting an existing matplotlib plot.

As a convenience, the component automatically creates a temporary folder on your computer and outputs the Plotly graph as an HTML file with the title of the plot as the filename. The dashboard also displays the path below the Plotly graph so you can easily find the file in the temp folder.

The dashboard provides some statistics and on the compared classes and a case processing summary. The statistics table is generated as an output of the Lifelines logrank test you selected.

The case processing summary is a nice rollup of the class data with event and censoring stats. This component is designed for right censored data. If you would like to learn more about censoring, I suggest exploring the featured reading at the top of this article.

Finally, a survival table is provided that includes timeline and class columns. If data is available, the you will see a % probability of survival to the time specified in the timeline column.

For additional ease of use, all of the tables output from the component, including an SVG version of the static Kaplan-Meier plot. Additionally, the temporary file location is passed out of the component as a flow variable.

What's next? Survival analysis is a rich and interesting area of analytics. There is a lot to learn and as the data problems become more complex, so does the modeling.

There is a lot I did not cover here, but this is a great starting point in your survival analysis journey. I am working on making the interactive Plotly Kaplan-Meier plot it's own Python package to elevate the functionality further. In the meantime, I'm also working on a few other survival analysis specific components that are at various levels of refinement. I hope to write about these soon!

I hope you find some use in this component and if you find any bugs or issues please feel free to reach out and let me know!

Until next time!

-John

?#plotly #python #knime #survival_analysis

Roberto Daniele Cadili

Data Scientist presso KNIME

1 年

Great work, John! You're back full power ?? ??

1 次回应

查看更多评论

要查看或添加评论，请登录

John D.的更多文章

KNIME Is A Sandbox!

2020年12月15日

KNIME Is A Sandbox!

It's Really Fun To Play In Too! Whenever my colleagues and I extoll the virtues of KNIME (and there are many) the…

1 条评论
Hilite and Find in Tables in KNIME

2020年8月30日

Hilite and Find in Tables in KNIME

In this article I am going to cover two features that I really feel like I could be using more to make my life easier…

2 条评论
Customizing Your KNIME Workspace

2020年8月17日

Customizing Your KNIME Workspace

We all want to be able to configure the look and feel of our workspace and our workflows. Luckily, KNIME is an…
Paginating GET Results Via Recursion Loop In KNIME

2020年8月9日

Paginating GET Results Via Recursion Loop In KNIME

I wanted to share a solution I devised for handling pagination of GET results in KNIME Analytics Platform. *Note: This…

5 条评论
Customize KNIME Components For Ease of Use

2020年8月3日

Customize KNIME Components For Ease of Use

Add some flair to your hard work and help out the End User by adding an image, detailed description and helpful port…

See all articles

Survival Curves in KNIME with Python and Plotly

John D.

领英推荐

John D.的更多文章

社区洞察

其他会员也浏览了

NaN, NaT and None - What's the difference?

The Rolling Hurst Exponent in Python (Trading)

Decoding the concept of probability of backtest overfitting: a step-by-step guide with python scripts and visual?aids.

Python basics

"Is Your Number Happy? Learn How to Check Using Python"

Python for AI/ML - Day 4

Handling strings in python

0.1 + 0.2 is not 0.3 in Python. Here Is Why!

Introduction to LaModel: Automate Results Saving with Python

Airflow Task - Part 3

领英推荐

John D.的更多文章

KNIME Is A Sandbox!

Hilite and Find in Tables in KNIME

Customizing Your KNIME Workspace

Paginating GET Results Via Recursion Loop In KNIME

Customize KNIME Components For Ease of Use

社区洞察

其他会员也浏览了

NaN, NaT and None - What's the difference?

The Rolling Hurst Exponent in Python (Trading)

Decoding the concept of probability of backtest overfitting: a step-by-step guide with python scripts and visual?aids.

Python basics

"Is Your Number Happy? Learn How to Check Using Python"

Python for AI/ML - Day 4

Handling strings in python

0.1 + 0.2 is not 0.3 in Python. Here Is Why!

Introduction to LaModel: Automate Results Saving with Python

Airflow Task - Part 3