Survival Curves in KNIME with Python and Plotly
BLUF: Here is a handy Python Lifelines and Plotly powered #knime component to make Kaplan-Meier plotting and analysis easy.
KNIME: Download it free here. The component in this article requires v4.7+
KM Curve Component: Download it free here. (I am finishing a version that will work with KNIME <4.7).
Further Reading:
Read: Chapters 3 & 6
Read: Chapter 8
6. Inspiration for the KNIME component here.
**Update** After publishing this article I saw it fit to update the component. It now includes the options to select the desired logrank test and the weighting. I will work Fleming-Harrington in as soon as I can.
It’s been some time since I’ve posted and even longer since I’ve written anything about KNIME. Due to hosting issues, my website knime.tips is down as I evaluate options. In the meantime, I wanted to share a tool I’ve been working on for a little while in hopes that someone might find it useful.
Disclaimer upfront, this article is not designed to provide a crash course in survival analysis. If you want to learn more about the topic please see the resources I've put together above in the 'further reading' section. Those are fantastic primary sources.
Survival Analysis is an analytic approach that serves to address questions around, “time until an event occurs” (Klein p. 4). Generally used in the medical field, survival analysis can actually be applied to many kinds of problems including those in Human Resources, Marketing and others. Outputs of survival analysis, “...give the probability that a study subject survives past a specified time” with a plotted Kaplan-Meier curve depicting this relationship in an easy to interpret fashion (Klein p. 63). There are many types of survival analysis with Kaplan-Meier being the most simple and straightforward of the options. Other options include Cox Regression, Random Survival Forests and many more.
KNIME already offers a Kaplan-Meier Estimator node that does a fine job and outputs survival curves. You can find it here. I wanted to take this a step further and leverage Python Lifelines and Plotly to generate more detailed outputs and interactivity, something that have not seen anywhere on the web as of yet.
To get this working of course you need the Python Lifelines and Plotly libraries installed and Python configured in KNIME. If you have not configured Python in KNIME please see this detailed integration guide. Once your Python environment is ready, drag the component to your workflow, connect the Conda Environment Propagation node and connect some data to it.
Since you already have Lifelines installed you can use a Python Script node to pull in some test data like the Rossi arrest recidivism dataset. You can load the data with the code below:
import knime.scripting.io as knio
from lifelines.datasets import load_rossi
input_table = load_rossi()
knio.output_tables[0] = knio.Table.from_pandas(input_table)
If you are using your own data ensure that your survival status data is encoded where 1 indicates a failure occurred and 0 indicates not failure.
Additionally, ensure that the time column is a continuous integer or double value.
Drag the component into your work flow, connect your data and double-click the component to bring up the options.
For Variable choose the column that has the classes you want to plot survival curves for.
Select your survival status column - this column has the binary flag to indicate failure/not failure, survival/not survival, quit/not quit etc.
领英推荐
Select your time column. Again, this column can be a double or integer and represents the timeframe you are working within (you should know if it's weeks, months, years etc.).
Enter the values you would like for the x-label, y-label and title of the output plots.
Provide an alpha value of your choice. This impacts how the confidence intervals are shaded on the output Kaplan-Meier plots and that number is tied to a level of significance. You may want to change this value based on your specific project and/or the domain you are operating within.
The plot resolution is a value that allows you control the number of time points calculated in the curves. The value is passed in kmf.fit() in the timelines argument. At the time of this writing if you pass 0 into this field, kmf.fit() will use the built-in default behavior for calculating this.
Finally, if you are interested in only seeing a subsection of the plotted curves you can enter your start and end choices here. If the checkbox for Max End Time For Timeline? is selected, the entire plot will generate. This only impacts the static plot generated from the Lifelines package.
As a note, Lifelines is an incredibly full featured and robust library. There is a lot I did not include in this component for the sake of ease of use. It is fairly easy to add additional capabilities specific to your problem set. It should be noted that there are many more static plot options parameters that you can tweak that I did not provide access to, such as transparency choices, lines, legends, etc. I encourage you to explore the Lifelines documentation to learn more.
When you right click the node and select Execute and Open Views, a dashboard is generated with two plots, and a few tables of data.
The Plotly output is on the left. It is fully interactive and allows you to select and de-select the classes in the legend. I built it to mirror the static plot as closely as possible in look and feel. You are welcome to explore the code and adjust transparencies, lines as you see fit.
*Note* The big difference between my Plotly code and the code in the inspiration link at the top of this article, is that the Plotly KM curves are being generated in Plotly and not converting an existing matplotlib plot.
As a convenience, the component automatically creates a temporary folder on your computer and outputs the Plotly graph as an HTML file with the title of the plot as the filename. The dashboard also displays the path below the Plotly graph so you can easily find the file in the temp folder.
The dashboard provides some statistics and on the compared classes and a case processing summary. The statistics table is generated as an output of the Lifelines logrank test you selected.
The case processing summary is a nice rollup of the class data with event and censoring stats. This component is designed for right censored data. If you would like to learn more about censoring, I suggest exploring the featured reading at the top of this article.
Finally, a survival table is provided that includes timeline and class columns. If data is available, the you will see a % probability of survival to the time specified in the timeline column.
For additional ease of use, all of the tables output from the component, including an SVG version of the static Kaplan-Meier plot. Additionally, the temporary file location is passed out of the component as a flow variable.
What's next? Survival analysis is a rich and interesting area of analytics. There is a lot to learn and as the data problems become more complex, so does the modeling.
There is a lot I did not cover here, but this is a great starting point in your survival analysis journey. I am working on making the interactive Plotly Kaplan-Meier plot it's own Python package to elevate the functionality further. In the meantime, I'm also working on a few other survival analysis specific components that are at various levels of refinement. I hope to write about these soon!
I hope you find some use in this component and if you find any bugs or issues please feel free to reach out and let me know!
Until next time!
-John
Data Scientist presso KNIME
1 年Great work, John! You're back full power ?? ??