登录查看更多内容

What is checkpointing?

Ritu A.

Reducing time-to-solutions

发布日期: 2023年1月18日

What is Checkpointing?

Checkpointing?is the process of periodically saving (or writing) the execution state of an application such that in the event of an interruption in the execution of an application, this saved state can be used to continue the execution at a later time. Typically, the execution state is written to a file.?Resuming the execution of an application using a previously saved state or checkpoint (instead of starting it from scratch) is referred to as the?Restart?phase.?

What are the?advantages?of?checkpointing?

Checkpointing not only saves time by offering the capability to resume the execution of an application in case of a hardware failure in the underlying computing platform (e.g., network interconnect failure) or if the computing platform becomes unavailable due to emergency maintenance, but it also helps in overcoming the time-limits associated with the different job queues/partitions.

What are the different types of checkpointing?

The different types of checkpointing include system-level checkpointing, application-level checkpointing, and user-level or library-level checkpointing.

System-Level checkpointing involves taking core-dumps of the computational state of the machine or system on which the application is running.

Pros: It is convenient to use, no code changes needed, user only specifies the checkpointing frequency.

Cons: It involves large memory-footprint of checkpoints as the entire execution state of the application and the operating system processes are saved during checkpointing, and system administrator level privileges are needed for installation of additional code.

Example: Berkeley Lab Checkpointing and Restart (BLCR)?

Library-Level or User-Level Checkpointing involves the use of libraries for taking checkpoints while being agnostic to kernel-level information such as process IDs.

Pros: It is useful for checkpointing applications without requiring any changes to the source-code or the operating system kernel.

Cons: The users may need to load the checkpointing library before starting their applications, and then, would need to dynamically link the loaded library to their applications. The checkpoints can have a large memory-footprint.

Example: DMTCP

Application-Level Checkpointing involves implementing the checkpoint-and-restart mechanism within the application itself. An efficient implementation of application-level checkpointing would require saving and reading the state of only those variables or data that are necessary for recreating the state of the entire application. Such variables or data are referred to as critical variables/data. As an example, consider the C code below (definition of myFct function is not included below).

int main(){?

?int x = 4;?

领英推荐

How to use the DEC2HEX Function in Google Sheets

工程关注我们，每天学习?? 11 个月前

How to Prime Your Data Lake

David Spark 1 年前

Observability A Must Not An Option

Wael Al-Wirr 3 年前

?int y = sqrt(x);?

?int z, i; int j = x*y;?

?for (i =0; i< 100; i++){?

?????z += j* myFct(randomNumber * i);?

?}?

?return 0;?

In this code, "i" and "z" are critical variables as their values are updated and cannot be derived easily to recreate the execution state of the code once it is interrupted.

Pros: Application-level checkpointing does not rely on the availability of any external libraries or tools, and hence, is useful for writing portable applications.

Cons: While an efficient implementation of this technique will generate checkpoints with smaller memory footprint and incur lesser I/O overheads as compared to other types of checkpointing, the onus is on the user (or the developer) to manually implement it on a per application basis, and therefore, the users should understand the code of the applications that they are checkpointing to manually reengineer the code for inserting checkpoint-restart logic.

In case of distributed (message passing?or MPI applications), a checkpoint can be written as a "central checkpoint" involving a single process (typically, the root or master or manager process in the MPI world) or a distributed checkpoint (involving multiple processes and an appropriate parallel I/O API calls and strategy).

What are the side-effects?of checkpointing?

Writing and reading the application states or checkpoints?introduces additional I/O overheads. Depending upon the frequency of checkpointing and the size of the checkpoint files, the IO overheads can add noticeable increase in the run-time and storage needs of an application.

Do you have sample code?

Here is the link to the GitHub repository containing sample code in C++ that has checkpointing and restart feature embedded in it: bsswfellowship/checkpointing at main · ritua2/bsswfellowship (github.com)

References

Arora, R., Bangalore, P. & Mernik, M. A technique for non-invasive application-level checkpointing.?J Supercomput?57, 227–255 (2011). https://doi.org/10.1007/s11227-010-0383-5
Ritu Arora, Trung Nguyen, "ITALC: Interactive Tool for Application-Level Checkpointing", HUST17 workshop at SC17, November 2017.

要查看或添加评论，请登录

Ritu A.的更多文章

Call for Abstracts for Presentations, Papers, Panels, BoFs, and Exhibits for Metrics2024

2024年8月19日

Call for Abstracts for Presentations, Papers, Panels, BoFs, and Exhibits for Metrics2024

Conference website: https://sites.google.
Managing anger and inappropriate communication at workplaces

2023年7月21日

Managing anger and inappropriate communication at workplaces

Workplace is NOT the right place where you can scream, use abusive language, talk in a threatening way, or ridicule…
Strategies for Building Fair, Inclusive, and Healthy Work Environments

2023年7月21日

Strategies for Building Fair, Inclusive, and Healthy Work Environments

Successful initiatives for building , , and work environments are more than just taking the annual compliance trainings…
Optimizing I/O

2023年6月24日

Optimizing I/O

1. OVERVIEW Every useful scientific application does some type of Input/Output (I/O).
Checkpointing in Python

2023年1月25日

Checkpointing in Python

Certain research and exploratory work may require running software applications for several days or weeks. Despite…
What is Cognitive Diversity?

2021年11月12日

What is Cognitive Diversity?

Diversity can play an important role in strengthening the performance and innovation of teams, and in creating fair…
Unleash the Power of "And"

2020年9月1日

Unleash the Power of "And"

In our day-to-day lives, we are often in situations where we have to make choices between seemingly conflicting options…

2 条评论
Debunking Seven Common Leadership Myths

2020年8月29日

Debunking Seven Common Leadership Myths

Leadership is a continuous process of improvement in people, processes, skills, environments, and culture. True leaders…
Developing Workplace Policies and Solutions for Supporting Women's Health

2019年10月2日

Developing Workplace Policies and Solutions for Supporting Women's Health

During the course of their life-cycle, women undergo menopause, which is a form of reproductive aging. There is no…
Call for Papers: Special Journal Issue on the "Software Challenges to Exascale Computing"

2019年2月20日

Call for Papers: Special Journal Issue on the "Software Challenges to Exascale Computing"

Dear Colleague, We are pleased to inform you that we will be publishing a special issue on the "Software Challenges to…

See all articles

What is checkpointing?

Ritu A.

Reducing time-to-solutions

What is Checkpointing?

What are the?advantages?of?checkpointing?

What are the different types of checkpointing?

领英推荐

What are the side-effects?of checkpointing?

Ritu A.的更多文章

社区洞察

其他会员也浏览了

Rebalancing the Partitions | Strategies for Rebalancing

RAID 1 & RAID 10

System Design: The Principle of Consistent Hashing

How To Recover Deleted Files From SD Card - A Comprehensive Guide

Remote Source and Commands on #IBMi

Handling Failures in Key-Value Stores: System Design

AVOID VENDOR PROPRIETARY LOCK-IN

Configuring RAID Using Lifecycle Controller: A Comprehensive Guide (2024)

The three pillars of a data system, with checklists to follow.

Mastering Node.js Buffer Allocation: Balancing Performance and Security

What is Checkpointing?

What are the?advantages?of?checkpointing?

What are the different types of checkpointing?

领英推荐

What are the side-effects?of checkpointing?

Ritu A.的更多文章

Call for Abstracts for Presentations, Papers, Panels, BoFs, and Exhibits for Metrics2024

Managing anger and inappropriate communication at workplaces

Strategies for Building Fair, Inclusive, and Healthy Work Environments

Optimizing I/O

Checkpointing in Python

What is Cognitive Diversity?

Unleash the Power of "And"

Debunking Seven Common Leadership Myths

Developing Workplace Policies and Solutions for Supporting Women's Health

Call for Papers: Special Journal Issue on the "Software Challenges to Exascale Computing"

社区洞察

其他会员也浏览了

Rebalancing the Partitions | Strategies for Rebalancing

RAID 1 & RAID 10

System Design: The Principle of Consistent Hashing

How To Recover Deleted Files From SD Card - A Comprehensive Guide

Remote Source and Commands on #IBMi

Handling Failures in Key-Value Stores: System Design

AVOID VENDOR PROPRIETARY LOCK-IN

Configuring RAID Using Lifecycle Controller: A Comprehensive Guide (2024)

The three pillars of a data system, with checklists to follow.

Mastering Node.js Buffer Allocation: Balancing Performance and Security