Processor designers face the challenge of defect formation, leading to permanent faults, during fabrication and operation. Permanent or hard fault tolerance is an important problem in computing systems, solutions to which can help improve yield during fabrication and reduce the cost of transistor mortality during the service life of the processor. This paper presents PreFix, a method to handle hard errors to keep a faulty core running and correctly executing instructions. Instead of turning off faulty structures, PreFix predicts early on whether an instruction is likely to use faulty components, then refines this prediction later in the pipeline to actually detect when an error has occurred. Instructions marked as possibly- faulty in the front-end are queued for duplicate execution on a separate core. At commit, results from the original and duplicate instructions are compared. Upon a mismatch, the original instruction is patched up, the pipeline flushed and execution continues. Using PreFix, faulty components can continue performing useful work when their errors do not manifest in architecturally visible state changes. This enhances processor lifetime with minimal performance overhead.
The different versions of the original document can be found in:
DOIS: 10.17863/cam.21614 10.1109/dft.2017.8244459
Published on 01/01/2017
Volume 2017, 2017
DOI: 10.17863/cam.21614
Licence: CC BY-NC-SA license
Are you one of the authors of this document?