REPAIR: Hard-Error Recovery via Re-Execution

Latest revision as of 16:30, 3 February 2021

Abstract

Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wear-out leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level—a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68× of a fully functioning system. This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) through grants EP/K026399/1 and EP/J016284/1. Experiments used the Darwin Supercomputer of the University of Cambridge High Performance Computing Service (http://www.hpc.cam.ac.uk/) funded by the Higher Education Funding Council for England and the Science and Technology Facilities Council. This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/DFT.2015.7315139

Original document

The different versions of the original document can be found in:

https://www.repository.cam.ac.uk/bitstream/1810/249256/1/Soman%20et%20al%202015%20Defect%20and%20Fault%20Tolerance%20in%20VLSI%20and%20Nanotechnology%20Systems%20Symposium.pdf

https://www.repository.cam.ac.uk/handle/1810/249256

http://xplorestaging.ieee.org/ielx7/7304347/7315124/07315139.pdf?arnumber=7315139,

http://dx.doi.org/10.1109/dft.2015.7315139

https://www.repository.cam.ac.uk/bitstream/1810/249256/1/Soman%20et%20al%202015%20Defect%20and%20Fault%20Tolerance%20in%20VLSI%20and%20Nanotechnology%20Systems%20Symposium.pdf,

https://dblp.uni-trier.de/db/conf/dft/dft2015.html#SomanMMJ15,

https://ieeexplore.ieee.org/document/7315139,

https://www.repository.cam.ac.uk/handle/1810/249256,

https://academic.microsoft.com/#/detail/1957991871

Latest revision as of 16:30, 3 February 2021

Abstract

Original document

Document information

Document Score

Share this document

Keywords

claim authorship

Revision as of 16:30, 3 February 2021 (view source) Scipediacontent (talk \| contribs) (Created page with " == Abstract == Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation...")	Latest revision as of 16:30, 3 February 2021 (view source) Scipediacontent (talk \| contribs) m (Scipediacontent moved page Draft Content 209077748 to Jones et al 2015a)
(No difference)