Graphics processing units are available solutions for high-performance safety-critical applications, such as self-driving cars. In this application domain, functional-safety and reliability are major concerns. Thus, the adoption of fault tolerance techniques is mandatory to detect or correct faults, since these devices must work properly, even when faults are present. GPUs are designed and implemented with cutting-edge technologies, which makes them sensitive to faults caused by radiation interference, such as single event upsets. These effects can lead the system to a failure, which is unacceptable in safety-critical applications. Therefore, effective detection and mitigation strategies must be adopted to harden the GPU operation. In this paper, we analyze transient effects in the pipeline registers of a GPU architecture. We run four applications at three GPU configurations, considering the source of the fault, its effect on the GPU, and the use of software-based hardening techniques. The evaluation was performed using a general-purpose soft-core GPU based on the NVIDIA G80 architecture. Results can guide designers in building more resilient GPU architectures.
The different versions of the original document can be found in:
Published on 01/01/2020
Volume 2020, 2020
DOI: 10.1109/isvlsi49217.2020.00076
Licence: CC BY-NC-SA license
Are you one of the authors of this document?