Tesi etd-06032022-224653 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
LO GERFO, MATTEO
URN
etd-06032022-224653
Titolo
Design and Verification of a Hardware Accelerator for Homomorphic Encryption
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
INGEGNERIA ELETTRONICA
Relatori
relatore Prof. Saponara, Sergio
correlatore Di Matteo, Stefano
correlatore Di Matteo, Stefano
Parole chiave
- hardware accelerator
- Homomorphic Encryption
- IoT
- Seal Embedded
Data inizio appello
20/06/2022
Consultabilità
Non consultabile
Data di rilascio
20/06/2092
Riassunto
Storing data in encrypted form on server or cloud can be considered secure but data must be decrypted in order to be processed, opening a window of possibility for cyber attackers to take them over. Homomorphic Encryption (HE) is born in order to perform specific operation over encrypted data (or ciphertext) avoiding the decryption step, generating a new ciphertext that is the already encrypted result of the operation that would have performed on plaintext. HE is nowadays considered a strong privacy-preserving solution that allows users to share data on cloud or on any non-secure server denying any chance for the attackers or even cloud owners to learn anything about them. In the last decade many open source HE library have been developed, including lattice-based libraries such as Microsoft Seal or PALISADE, whose security relies on the assumption that lattice problems are considered intractable for both quantum and classical computers. Their proven hardness and resistance even against quantum computer attacks make lattice-base cryptographic constructions a valid alternative of modern cryptosystems and the leading candidates for the post-quantum cryptography standardization project organized by NIST in 2016. However, even if lattice-based cryptography and homomorphic encryption guarantee data safety, on the other hand they require high computational resources and memory consumption, which limits its use on resources-constrained IoT devices. From the aforementioned Microsoft SEAL, an embedded-oriented spin-off library has been developed in 2021, the SEAL Embedded (SE), which is the first lattice-based HE library specifically designed for embedded devices. Following the Ring Learning With Errors (RLWE) decision problem, which is recognized to be quantum resistant, this library implements a particular HE scheme called CKKS that allows encryption over floating point numbers. The RLWE algorithm is built on arithmetic of polynomials where coefficients belongs to a cyclotomic ring R modulo Q; given a secret key s, the ciphertext is composed of two polynomials (a,−a · s + m + e) where a is sampled from a discrete gaussian distribution over R, e is sampled from a uniform distribution and m is the plaintext. All the modular polynomial operations employed to evaluate the ciphertext require time and resources to be performed. In order to evaluate fast polynomial multiplication SE library employs a particular Discrete Fourier Transform called Number Theoretic Transform (NTT): given an input vector of polynomial coefficients, his elements are multiplied with twiddle factors (powers of a primitive root of unity) and combined in a butterfly-manner operation. Despite SE memory and performance optimizations this operation still is onerous to compute and also requires an amount of memory proportional to the polynomial degree in order to store all the roots. In this thesis we propose a hardware accelerator designed to overcome lacks in performance and memory consumption during the encryption process. We carried out a benchmark campaign of the SE library on two RISC-V soft cores (i.e. the 32-bit RISCY and the 64-bit CAV6) implemented on a Xilinx ZCU106 board equipped with the Zynq-Ultrascale+ MPSoC. According to the library paper, our benchmarks revealed that the encryption bottleneck was the NTT: with a polynomial degree set to 4096, its execution requires up to 200 milliseconds on the RISCY, resulting in a duration of 3 seconds to accomplish a single encryption of 8 KB. Furthermore, even if SE main targets are resource-constrained devices, for polynomial degree higher than 4096 it couldn’t be successfully executed due to the small on-chip memory available. Therefore our target is to reduce the amount of time needed to perform NTT and modular operations and to optimize the memory required to store all the roots NTT needs. Accelerator design is composed of three main modules:
• An Arithmetic Logic Unit (ALU), used to perform the RLWE algorithm
• A Root Generator that computes all the roots of unity
• An AXI Slave Interface to exchange data with the CPU
Our system is also equipped with two random access memories:
• a 64 KB dual port RAM (DPRAM) accessed by both CPU and ALU, used to store all the polynomial coefficients needed for encryption and the resulting ciphertext
• A 64 KB single port RAM accessed by both Root Generator and ALU, necessary to store the evaluated roots
The system is intended to work with all the polynomial degrees available from SE, the workflow runs as follows. The software evaluates the coefficient vectors of s, a and (m + e) and then configure the accelerator, which starts the root generation. Processor writes the vector s inside the DPRAM and the accelerator perform the first NTT, same for vector (m+e). Finally, processor writes the vector a, which is already in NTT form, inside the DPRAM and the accelerator starts the RLWE-encryption, storing the result inside the DPRAM to make it accessible by the processor. The hardware accelerator has been designed in SystemVerilog HDL and tested using Questa Advanced Simulator. For the verification phase, a simulation environment has been developed that encompasses a AXI4 Master emulator for the generation of the AXI4 stimuli and transactions. Design has been synthesized using Xilinx Vivado Design suite on the target FPGA board (i.e. Xilinx ZCU106 board equipped with the Zynq-Ultrascale+ MPSoC), reaching 150 MHz of frequency. A new benchmark campaign has been performed running the SE code on the RISCY-based system plus the proposed hardware accelerator connected through a standard AXI4 interface. The achieved speed-up is around x20 for the whole encryption process, and the memory saving up to 832 KB. Since the AXI interface slows down data exchange due to his high latency, for the final tests our design was equipped by a Direct Memory Access module, providing a x95 encryption speed-up.
• An Arithmetic Logic Unit (ALU), used to perform the RLWE algorithm
• A Root Generator that computes all the roots of unity
• An AXI Slave Interface to exchange data with the CPU
Our system is also equipped with two random access memories:
• a 64 KB dual port RAM (DPRAM) accessed by both CPU and ALU, used to store all the polynomial coefficients needed for encryption and the resulting ciphertext
• A 64 KB single port RAM accessed by both Root Generator and ALU, necessary to store the evaluated roots
The system is intended to work with all the polynomial degrees available from SE, the workflow runs as follows. The software evaluates the coefficient vectors of s, a and (m + e) and then configure the accelerator, which starts the root generation. Processor writes the vector s inside the DPRAM and the accelerator perform the first NTT, same for vector (m+e). Finally, processor writes the vector a, which is already in NTT form, inside the DPRAM and the accelerator starts the RLWE-encryption, storing the result inside the DPRAM to make it accessible by the processor. The hardware accelerator has been designed in SystemVerilog HDL and tested using Questa Advanced Simulator. For the verification phase, a simulation environment has been developed that encompasses a AXI4 Master emulator for the generation of the AXI4 stimuli and transactions. Design has been synthesized using Xilinx Vivado Design suite on the target FPGA board (i.e. Xilinx ZCU106 board equipped with the Zynq-Ultrascale+ MPSoC), reaching 150 MHz of frequency. A new benchmark campaign has been performed running the SE code on the RISCY-based system plus the proposed hardware accelerator connected through a standard AXI4 interface. The achieved speed-up is around x20 for the whole encryption process, and the memory saving up to 832 KB. Since the AXI interface slows down data exchange due to his high latency, for the final tests our design was equipped by a Direct Memory Access module, providing a x95 encryption speed-up.
File
Nome file | Dimensione |
---|---|
Tesi non consultabile. |