Tesi etd-04102019-140633 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
BALLERI, SARA
URN
etd-04102019-140633
Titolo
Exploring time vs. area trade-offs in the deployment of Deep Neural Networks on reconfigurable FPGAs
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
EMBEDDED COMPUTING SYSTEMS
Relatori
relatore Prof. Buttazzo, Giorgio C.
correlatore Dott. Biondi, Alessandro
correlatore Pagani, Marco
correlatore Dott. Biondi, Alessandro
correlatore Pagani, Marco
Parole chiave
- apsoc
- binary neural networks
- deep neural networks
- dynamic partial reconfiguration
- FINN
- FPGA
- FRED
- hardware acceleration
- PYNQ
- Zynq
Data inizio appello
03/05/2019
Consultabilità
Non consultabile
Data di rilascio
03/05/2089
Riassunto
Recent researches on neural networks have shown significant advantages in machine learning over traditional algorithms based on handcrafted features and models. Instead of engineering algorithms by hand, the possibility to automatically train computing systems from huge amounts of data has led to an improvement of performances in important domains such as computer vision, robotics, web search, and speech recognition.
Nowadays, the most popular class of techniques used in these domains is called deep learning. Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of fields. These computing models have demonstrated their ability and effectiveness in solving complex learning problems, but they require complex and memory-intensive computations that make general-purpose CPUs ineffective to achieve reasonable performance levels. For instance, most of the modern DNN architectures include convolutional layers that have very high computational requirements that cannot be satisfied by even modern CPUs.
For this reason, specialized hardware is typically needed to accelerate the computations required by DNN.
The current state of hardware acceleration for deep learning is largely dominated by General Purpose Graphics Processing Units (GPGPUs) due to the rich availability of software stacks to simplify their deployment on such platforms.
When performing the computations required by DNNs, GPGPUs improve the performance of CPUs by orders of magnitude thanks to their greater ability in performing parallel computations. Due to the pervasive application and the effectiveness of DNNs, there is an increasing research and industrial interest in applying deep learning in the context of embedded systems. In this context, it is crucial to ensure a time-predictable and energy-efficient execution of the computations required by DNNs. Unfortunately, GPGPUs generally fail in matching these requirements.
Field Programmable Gate Arrays (FPGAs) represent a very flexible alternative to GPGPUs for developing high-performance hardware acceleration. They generally provide better per-watt performance than GPGPUs, and are definitively more time-predictable. Indeed, they are considered a strong competitor of GPGPUs for deploying DNNs. On the other hand, programming FPGAs requires hardware-specific knowledge and they have been often considered platforms for specialists. However, during the last decade, several tools for adopting software-like programming models for FPGAs (e.g., high-level synthesis) became quite mature making FPGAs a more attractive choice for deploying DNNs.
Nevertheless, even with the availability of these tools, the acceleration of DNNs is particularly challenging due to the very complex computations they require. For this reason, several frameworks have been developed to help deploy DNNs on FPGAs. These frameworks generate hardware accelerators for DNNs that are extremely demanding in terms of FPGA resources and are often incompatible with platforms with small FPGA fabric. However, in most practical cases, system designers could need to deploy other hardware modules on FPGAs together with the ones required for DNNs acceleration, or need to dispose of DNNs acceleration on resource-constrained FPGAs (e.g., in small, low-cost embedded devices).
Contribution
This thesis addresses this issue by specifically considering DNNs accelerators for resource-constrained FPGAs. We leverage Dynamic Partial Reconfiguration (DPR), a prominent feature of modern FPGAs that allows reconfiguring at run-time a portion of their area while the rest of the modules deployed on other portions continue to operate. To handle DNNs accelerators under DPR, we propose a DNNs partitioning technique that allows deploying DNNs accelerators on FPGA fabrics in which they would not entirely fit. DNNs are decomposed in partitions that are cyclically programmed on the FPGA. DNNs partitions are also optimized to improve their timing performance. Specifically, this work performs a in-depth analysis of the DNNs runtime performance in the presence of partitioning and explores the trade-offs between FPGA area consumption and execution time when optimizing the DNNs partitioning.
Nowadays, the most popular class of techniques used in these domains is called deep learning. Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of fields. These computing models have demonstrated their ability and effectiveness in solving complex learning problems, but they require complex and memory-intensive computations that make general-purpose CPUs ineffective to achieve reasonable performance levels. For instance, most of the modern DNN architectures include convolutional layers that have very high computational requirements that cannot be satisfied by even modern CPUs.
For this reason, specialized hardware is typically needed to accelerate the computations required by DNN.
The current state of hardware acceleration for deep learning is largely dominated by General Purpose Graphics Processing Units (GPGPUs) due to the rich availability of software stacks to simplify their deployment on such platforms.
When performing the computations required by DNNs, GPGPUs improve the performance of CPUs by orders of magnitude thanks to their greater ability in performing parallel computations. Due to the pervasive application and the effectiveness of DNNs, there is an increasing research and industrial interest in applying deep learning in the context of embedded systems. In this context, it is crucial to ensure a time-predictable and energy-efficient execution of the computations required by DNNs. Unfortunately, GPGPUs generally fail in matching these requirements.
Field Programmable Gate Arrays (FPGAs) represent a very flexible alternative to GPGPUs for developing high-performance hardware acceleration. They generally provide better per-watt performance than GPGPUs, and are definitively more time-predictable. Indeed, they are considered a strong competitor of GPGPUs for deploying DNNs. On the other hand, programming FPGAs requires hardware-specific knowledge and they have been often considered platforms for specialists. However, during the last decade, several tools for adopting software-like programming models for FPGAs (e.g., high-level synthesis) became quite mature making FPGAs a more attractive choice for deploying DNNs.
Nevertheless, even with the availability of these tools, the acceleration of DNNs is particularly challenging due to the very complex computations they require. For this reason, several frameworks have been developed to help deploy DNNs on FPGAs. These frameworks generate hardware accelerators for DNNs that are extremely demanding in terms of FPGA resources and are often incompatible with platforms with small FPGA fabric. However, in most practical cases, system designers could need to deploy other hardware modules on FPGAs together with the ones required for DNNs acceleration, or need to dispose of DNNs acceleration on resource-constrained FPGAs (e.g., in small, low-cost embedded devices).
Contribution
This thesis addresses this issue by specifically considering DNNs accelerators for resource-constrained FPGAs. We leverage Dynamic Partial Reconfiguration (DPR), a prominent feature of modern FPGAs that allows reconfiguring at run-time a portion of their area while the rest of the modules deployed on other portions continue to operate. To handle DNNs accelerators under DPR, we propose a DNNs partitioning technique that allows deploying DNNs accelerators on FPGA fabrics in which they would not entirely fit. DNNs are decomposed in partitions that are cyclically programmed on the FPGA. DNNs partitions are also optimized to improve their timing performance. Specifically, this work performs a in-depth analysis of the DNNs runtime performance in the presence of partitioning and explores the trade-offs between FPGA area consumption and execution time when optimizing the DNNs partitioning.
File
Nome file | Dimensione |
---|---|
Tesi non consultabile. |