Checkpointing in Python

Ritu A.

Reducing time-to-solutions

发布日期: 2023年1月25日

Certain research and exploratory work may require running software applications for several days or weeks. Despite exempting such applications from the fair-use limits of the underlying shared computing platforms and allowing them to run for longer than normal durations, it is not uncommon to see the applications getting interrupted due to unplanned reasons. On certain occasions, we have seen such long-running applications crash due to unforeseen hardware issues that are triggered by the applications themselves. On some other occasions we had to urgently apply some firmware updates or do security related system patching due to which, we had to announce emergency maintenance and take the HPC system offline, thereby, terminating all the applications running on the system at that time. On yet another occasion, we saw some long-running jobs/applications getting terminated due to the actions of other customers on the shared HPC platform - just one customer can cause heavy, suboptimal IO that can bring down the filesystem for the entire customer base. Due to such reasons, it is furthermore important to make your applications write checkpoints at optimal frequencies so that you do not lose your entire progress and can restart your application from the latest checkpoint at a later stage.

There are exploratory workflows/pipelines that require human in the loop for repeating certain steps in the workflow selectively and iteratively, thereby also requiring the functionality of separately checkpointing each step in the workflow/pipeline.

In a previous article, we reviewed the basics of checkpointing and following is the link to that article: What is checkpointing? | LinkedIn . A sample C++ code demonstrating the checkpointing and restart capability was also shared through the following GitHub repository: bsswfellowship/checkpointing at main · ritua2/bsswfellowship (github.com)

In this article we will review the steps for adding the checkpointing - or save and restart - capabilities in Python code. Python supports object serialization and deserialization through its Pickle module and detailed information on Pickle is available at the following link: pickle — Python object serialization — Python 3.11.1 documentation. A simple Python code is shown below to demonstrate how Pickle can be used for implementing the checkpointing and restart (or the save and restart) functionality. This code is also available for download through the following GitHub repository: bsswfellowship/chkpt.py at main · ritua2/bsswfellowship (github.com)

class="font-[700] italic">import os

#some

useful code before this line

In line # 14 above, we use pickle.dump to covert the data - here the value of 'a' - into byte stream, and this byte stream is written to a file named "ckptfile.pickle" that was opened for writing in line # 13. Note that we are doing binary IO here.

In lines # 18-22 of the code above, we check if a file named "ckptfile.pickle" already exists or not. If the file exists - that is, a checkpoint was written and the file in which the checkpoint was written is available - then, we initialize the value of 'start' to the value written in the file. However, if the file does not exist - because the checkpoint may have not been written or the checkpoint file may have been deleted - 'start' is set to 0 and the code begins executing from the beginning.

The steps shown below demonstrate how to run the code, interrupt it to generate the checkpoint file, and proceed either from the checkpoint or normally.

#To run the Python code shown above

$ python chkpt.py

For testing, interrupt the running code by typing ctrl+c

^CWould you like to continue running the code? Type the letter y for yes.

#After running the code for the first time, you will see the chptfile.pickle generated

$ ls

chkpt.py ckptfile.pickle

#Let us remove the chptfile.pickle

$ rm ckptfile.pickle

$ ls chkpt.py

#Let us run the code again. It will start from the beginning as we removed the file # name chptfile.pickle. We will interrupt the code while it is running using ctrl+c, # and then restart

$ python chkpt.py

For testing, interrupt the running code by typing ctrl+c

^CWould you like to continue running the code? Type the letter y for yes.y

^CWould you like to continue running the code? Type the letter y for yes.

# As you notice from the values displayed above, instead of starting to print from # 1, the code starts printing from the next number in the series that was printed # before interruption

$ ls

chkpt.py ckptfile.pickle

# Now let us run the code again. Because the chptfile.pickle is present, the series # in the example below begins from 22 and not 1. Before the code was # interrupted as shown above, the last number that was printed was 21

$ python chkpt.py

For testing, interrupt the running code by typing ctrl+c

^CWould you like to continue running the code? Type the letter y for yes.

References

pickle — Python object serialization — Python 3.11.1 documentation
pickle - How to "stop" and "resume" long time running Python script? - Stack Overflow

要查看或添加评论，请登录

Ritu A.的更多文章

Call for Abstracts for Presentations, Papers, Panels, BoFs, and Exhibits for Metrics2024

2024年8月19日

Call for Abstracts for Presentations, Papers, Panels, BoFs, and Exhibits for Metrics2024

Conference website: https://sites.google.
Managing anger and inappropriate communication at workplaces

2023年7月21日

Managing anger and inappropriate communication at workplaces

Workplace is NOT the right place where you can scream, use abusive language, talk in a threatening way, or ridicule…
Strategies for Building Fair, Inclusive, and Healthy Work Environments

2023年7月21日

Strategies for Building Fair, Inclusive, and Healthy Work Environments

Successful initiatives for building , , and work environments are more than just taking the annual compliance trainings…
Optimizing I/O

2023年6月24日

Optimizing I/O

1. OVERVIEW Every useful scientific application does some type of Input/Output (I/O).
What is checkpointing?

2023年1月18日

What is checkpointing?

What is Checkpointing? Checkpointing is the process of periodically saving (or writing) the execution state of an…
What is Cognitive Diversity?

2021年11月12日

What is Cognitive Diversity?

Diversity can play an important role in strengthening the performance and innovation of teams, and in creating fair…
Unleash the Power of "And"

2020年9月1日

Unleash the Power of "And"

In our day-to-day lives, we are often in situations where we have to make choices between seemingly conflicting options…

2 条评论
Debunking Seven Common Leadership Myths

2020年8月29日

Debunking Seven Common Leadership Myths

Leadership is a continuous process of improvement in people, processes, skills, environments, and culture. True leaders…
Developing Workplace Policies and Solutions for Supporting Women's Health

2019年10月2日

Developing Workplace Policies and Solutions for Supporting Women's Health

During the course of their life-cycle, women undergo menopause, which is a form of reproductive aging. There is no…
Call for Papers: Special Journal Issue on the "Software Challenges to Exascale Computing"

2019年2月20日

Call for Papers: Special Journal Issue on the "Software Challenges to Exascale Computing"

Dear Colleague, We are pleased to inform you that we will be publishing a special issue on the "Software Challenges to…

See all articles

社区洞察

Performance Tuning

How do you use Cython, Numba, or PyPy to compile or accelerate Python code?

Ritu A.的更多文章

Call for Abstracts for Presentations, Papers, Panels, BoFs, and Exhibits for Metrics2024

Managing anger and inappropriate communication at workplaces

Strategies for Building Fair, Inclusive, and Healthy Work Environments

Optimizing I/O

What is checkpointing?

What is Cognitive Diversity?

Unleash the Power of "And"

Debunking Seven Common Leadership Myths

Developing Workplace Policies and Solutions for Supporting Women's Health

Call for Papers: Special Journal Issue on the "Software Challenges to Exascale Computing"

社区洞察