logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-02272025-225803


Tipo di tesi
Tesi di dottorato di ricerca
Autore
FERRANTE, NICOLA
URN
etd-02272025-225803
Titolo
Efficient and Effective SW-Based Fault Protection Techniques for modern CPUs
Settore scientifico disciplinare
IINF-01/A - Elettronica
Corso di studi
INGEGNERIA DELL'INFORMAZIONE
Relatori
tutor Prof. Fanucci, Luca
tutor Rossi, Francesco
Parole chiave
  • error detection
  • fault protection
  • safety-critical systems
Data inizio appello
05/03/2025
Consultabilità
Non consultabile
Data di rilascio
05/03/2095
Riassunto
The increasing complexity of the functions that must be performed by computing systems requires the use of the most advanced HW platforms and VLSI technologies. In domains like automotive, railway and industry, where Functional Safety (FuSa) must be achieved and demonstrated, this opens several challenges. Indeed, the required safety functions shall not have a negative impact on the high performance of such HW platforms. Furthermore, the failure modes considered for the evaluation of the effectiveness of such safety functions must be representative of the physical behaviour of transistor devices.
These challenges have addressed dividing them in three topics, 1) providing efficient permanent random HW fault detection, 2) providing efficient transient error detection 3) adopting effective and realistic fault models.

The research made in this thesis has been done focusing on application class CPU cores, which are a fundamental component of high-performance computing platforms, thanks to their flexibility and computational power.

The research on topic 1) has led to the development of a HW support to reduce the overhead caused by the interleaving of functional and non-functional code, i.e., Self-Test-Libraries (STL). The HW support, named HUSTLE, reduce this overhead by providing a separate memory channel for hosting STL code, and providing an efficient scheduling mechanism that exploit the idle time of the CPU caused by cache-misses. Results obtained by the experimental campaign confirm that HUSTLE can be used to execute significant amount of code while CPU is waiting instructions retrieval during a cache miss. Moreover, its usage allows to reduce the CPU load, enabling high-frequency scheduling of STLs in applications having demanding safety requirements, with minimal impact on area and power consumption.

Regarding topic 2) we evaluated the impact on the execution time of state-of-the-art instruction duplication techniques on out-of-order (OoO) CPU cores. Firstly, we analysed the possible failure modes of an OoO CPU pipeline caused by transient errors, to identify the mitigations that instruction duplication schemes shall implement. Then, we selected among the state-of-the-art instruction duplication schemes the ones which implement those mitigations, and we evaluated their impact in terms of additional number of instructions executed and additional execution time required, on CPU with increasing pipeline size. Results show that increasing the pipeline size a significant part of the overhead caused by the execution of additional instructions is absorbed by the CPU architecture, still the overhead measured is non-negligible. Hence, a possible research direction to improve this is to identify those instructions which have higher impact on the overall safety of the system, to provide ad-hoc solutions that allows for lower overhead without impairing the detection capabilities.

The last topic addressed is the adoption of effective fault models for the design and evaluation online testing mechanisms. Advanced technology nodes have higher sensibility to aging mechanisms as Bias Temperature Instability (BTI), and Hot Carrier Injection (HCI), which cause a shift of device parameters, that results in higher transition delays, that can sum up on sequence of logic gates, up to the violation of circuit timing constraints. The stuck-at fault model, that is currently used in the evaluation of online testing mechanisms, do not allow to capture this behaviour. Among the state-of-the-art fault models, the Transition Path Delay Fault model (TPDF) is the one that allows to better capture this behaviour. The adoption of the transition path delay fault model, however, brings several problems, because of its larger complexity and poor support in Electronic Design Automation (EDA) tools for functional testing. Hence, a methodology to handle this complexity has been formalized. The problem of the selection of those paths that are more likely to cause a failure, the candidate failing paths (CFP), and performing fault injection simulation according to this model have been addressed. Regarding the first one, we developed a flow that exploiting available EDA tools allow the extraction of the CFP set, applying aging functions. Then, we developed a procedure to enable functional fault injection campaigns on large designs, exploiting zero-delay gate-level simulations. The procedure exploits simple and portable test-logic which can be easily integrated in testbenches to accurately model the effect of delay faults and allow performing TPDF simulation with time comparable with stuck-at fault simulations. Future works will complete the methodology by providing guidelines for the development of test routines, and the computation of fault coverage.

The research on the topics identified allows to address the problems arising from the usage of high-performance HW platforms in critical application domains, by means of improvements and innovative solution enabling to achieve the high-integrity requirements required, while not impairing performance.
File