High performance fault tolerance through predictive instruction re-execution

Abstract

Processor designers face the challenge of defect formation, leading to permanent faults, during fabrication and operation. Permanent or hard fault tolerance is an important problem in computing systems, solutions to which can help improve yield during fabrication and reduce the cost of transistor mortality during the service life of the processor. This paper presents PreFix, a method to handle hard errors to keep a faulty core running and correctly executing instructions. Instead of turning off faulty structures, PreFix predicts early on whether an instruction is likely to use faulty components, then refines this prediction later in the pipeline to actually detect when an error has occurred. Instructions marked as possibly- faulty in the front-end are queued for duplicate execution on a separate core. At commit, results from the original and duplicate instructions are compared. Upon a mismatch, the original instruction is patched up, the pipeline flushed and execution continues. Using PreFix, faulty components can continue performing useful work when their errors do not manifest in architecturally visible state changes. This enhances processor lifetime with minimal performance overhead.