Soft Errors in VLSI

Kailash Prasad

Design Engineer @ Arm | PMRF | IIRF | nanoDC Lab | IIT Gandhinagar | NIT Arunachal Pradesh | Gold Medalist

发布日期: 2023年12月26日

Soft errors in VLSI are a major challenge to the reliability of modern electronic systems. Soft errors are transient faults caused by energetic particles, such as cosmic rays or alpha particles, striking the sensitive regions of semiconductor devices. These particles can generate charge carriers in the silicon substrate, which may alter the logic state of a transistor or a memory cell. Unlike hard errors, which permanently damage the device, soft errors are temporary and can be corrected by resetting the circuit.

Soft errors can have a significant impact on the functionality and performance of VLSI circuits, especially in safety-critical applications such as automotive or healthcare systems. For example, a single bit flip in a processor register or a memory element can cause incorrect computation, data corruption, or system crash. Therefore, it is essential to design VLSI circuits that are robust and resilient to soft errors.

There are various methodologies to mitigate soft errors in VLSI circuits, ranging from device-level to system-level techniques. Some of the common methods are:

Device-level techniques: These techniques aim to reduce the sensitivity of the device to particle strikes, by using different materials, doping profiles, layouts, or shielding methods. For example, using silicon-on-insulator (SOI) technology can reduce the parasitic capacitance and the charge collection area of the device, thus lowering the probability of soft errors. Some other examples of device-level techniques are:

Increasing the oxide thickness to reduce the charge collection efficiency.
Using low-resistivity substrates to increase the charge dissipation rate.
Adding guard rings or guard bands to isolate the sensitive nodes from the particle strikes.
Using metal or dielectric layers to shield the device from external radiation.

Circuit-level techniques: These techniques aim to increase the robustness of the circuit to transient pulses, by using different logic styles, gate sizing, logic restructuring, or redundancy methods. For example, using dual modular redundancy (DMR) or triple modular redundancy (TMR) can detect and correct soft errors by replicating the circuit and comparing the outputs. Some other examples of circuit-level techniques are:

Using differential logic styles, such as dynamic logic or dual-rail logic, to increase the noise immunity and the critical charge of the circuit.
Using adaptive gate sizing to balance the delay and the power consumption of the circuit under different operating conditions.
Using logic restructuring to minimize the number of critical paths and the logic depth of the circuit.
Using error detection codes, such as parity or Berger codes, to identify the erroneous outputs without duplication.

Architecture-level techniques: These techniques aim to improve the reliability of the system by using different error detection and correction (EDAC) schemes, such as parity, checksum, or cyclic redundancy check (CRC). For example, using error correcting codes (ECC) can correct soft errors in memory modules by adding extra bits to the data and using a decoder to recover the original data. Some other examples of architecture-level techniques are:

Using redundant execution, such as lockstep or rollback-recovery, to compare the results of multiple processors or cores and recover from faults.
Using checkpointing and rollback, such as shadow registers or history buffers, to save the state of the system periodically and restore it in case of errors.
Using scrubbing or refreshing, such as periodic or on-demand, to detect and correct errors in memory cells before they accumulate or propagate.
Using fault injection or emulation, such as software or hardware, to test the system under different fault scenarios and evaluate its reliability.

Software-level techniques: These techniques aim to enhance the fault tolerance of the software by using different programming paradigms, such as exception handling, checkpointing, or recovery methods. For example, using retry loops can handle soft errors by repeating the execution of a code segment until a correct result is obtained. Some other examples of software-level techniques are:

Using assertions or contracts to verify the correctness of the input, output, or intermediate values of the software.
Using watchdog timers or heartbeats to monitor the status of the software and trigger a reset or a recovery action if needed.
Using software diversity or redundancy, such as N-version programming or recovery blocks, to execute multiple versions of the software and select the best output.
Using software hardening or immunization, such as masking or voting, to protect the software from the effects of soft errors by adding extra code or data.

Soft errors in VLSI are a present and future problem that requires continuous research and innovation. I hope this post has given you some useful information and sparked your interest in this topic. If you want to learn more, you can check out some of the references below. Thank you for reading and feel free to share your thoughts and comments.

References:

- [Soft Error Reliability of VLSI Circuits: Analysis and Mitigation Techniques](https://link.springer.com/book/10.1007/978-3-030-51610-9)

- [Soft Errors in VLSI: Present and Future](https://ieeexplore.ieee.org/document/1135487)

- [A survey of circuit-level soft error mitigation methodologies](https://link.springer.com/article/10.1007/s10470-018-1300-8)

- [Soft Error Rate Estimation of VLSI Circuits](https://link.springer.com/chapter/10.1007/978-3-030-51610-9_2)

- [Introduction: Soft Error Modeling](https://link.springer.com/chapter/10.1007/978-3-030-51610-9_1)

MinJi Lee

Sales Director at IROC Technologies

1 年

We can mitigate soft errors even from cell-level by using TFIT from IROC Technologies. https://www.iroctech.com/tfit-best-in-class-cell-level-soft-error-detector-iroc/

Kumar Abhishek

Sr Engineering Director : SoC Verification

1 年

Architectural solutions are more crucial than technology solutions because the technology needs for soft error resilience on many occasions go against the fundamental aspect of lower power especially when designers are inclined to drop voltage to get better power figures , noise tolerance drops big time. ECC , Triple voting on critical registers and their intelligent physical placement , fault handling mechanisms are an area of much needed innovation hence..

1 次回应

Ravi Teja Velpula

Staff Device Engineer at SK Hynix America

1 年

Barsha Jain

Tooba Arifeen

Asst. Prof /Researcher | Computer Engineering| IC Design| Telecommunication| Research Interests: Digital Design Optimization, Fault Tolerance, Approximate Computing, DNN Inference Acceleration

1 年

Some of my works in the domain of approximate computing for TMR to combat soft errors might be of interest: https://scholar.google.co.kr/citations?user=wWH5jasAAAAJ&hl=en

Varun M S

1 年

We use to have a joke in my last team that if you can't debug a problem then it's a cosmic ray bit flip issue !

3 次回应

查看更多评论

要查看或添加评论，请登录

Kailash Prasad的更多文章

Research opportunities in Hardware design for True Random Number Generation

2024年2月3日

Research opportunities in Hardware design for True Random Number Generation

A True Random Number Generator (TRNG) is a device that generates random numbers from a physical process that produces…

4 条评论
Research Opportunities in Design of Approximate Arithmetic Circuits

2024年1月20日

Research Opportunities in Design of Approximate Arithmetic Circuits

"Although this may seem a paradox, all exact science is dominated by the idea of approximation. When a man tells you…

6 条评论
Research Opportunities in Hardware Design for Number System

2024年1月19日

Research Opportunities in Hardware Design for Number System

Number System Number system is the way of representing and manipulating numerical values in hardware. Different number…
Dynamic Vision Sensor (DVS) and the Dynamic Audio Sensor (DAS)

2024年1月16日

Dynamic Vision Sensor (DVS) and the Dynamic Audio Sensor (DAS)

Have you ever wondered how the human eye??? and ear?? can process complex and dynamic scenes with such high speed and…

3 条评论
Comprehensive List of Free Online VLSI Design Courses

2024年1月15日

Comprehensive List of Free Online VLSI Design Courses

In the beginning of 2023, I shared a list of VLSI courses that could be pursued by VLSI aspirants for free online. The…

25 条评论

See all articles

Kailash Prasad的更多文章

Research opportunities in Hardware design for True Random Number Generation

Research Opportunities in Design of Approximate Arithmetic Circuits

Research Opportunities in Hardware Design for Number System

Dynamic Vision Sensor (DVS) and the Dynamic Audio Sensor (DAS)

Comprehensive List of Free Online VLSI Design Courses

社区洞察