ETD system

Electronic theses and dissertations repository


Tesi etd-04102019-140633

Thesis type
Tesi di laurea magistrale
Exploring time vs. area trade-offs in the deployment of Deep Neural Networks on reconfigurable FPGAs
Corso di studi
relatore Prof. Buttazzo, Giorgio C.
correlatore Dott. Biondi, Alessandro
correlatore Pagani, Marco
Parole chiave
  • deep neural networks
  • FRED
  • Zynq
  • FPGA
  • binary neural networks
  • apsoc
  • dynamic partial reconfiguration
  • hardware acceleration
  • PYNQ
  • FINN
Data inizio appello
secretata d'ufficio
Data di rilascio
Riassunto analitico
Recent researches on neural networks have shown significant advantages in machine learning over traditional algorithms based on handcrafted features and models. Instead of engineering algorithms by hand, the possibility to automatically train computing systems from huge amounts of data has led to an improvement of performances in important domains such as computer vision, robotics, web search, and speech recognition.
Nowadays, the most popular class of techniques used in these domains is called deep learning. Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of fields. These computing models have demonstrated their ability and effectiveness in solving complex learning problems, but they require complex and memory-intensive computations that make general-purpose CPUs ineffective to achieve reasonable performance levels. For instance, most of the modern DNN architectures include convolutional layers that have very high computational requirements that cannot be satisfied by even modern CPUs.
For this reason, specialized hardware is typically needed to accelerate the computations required by DNN.
The current state of hardware acceleration for deep learning is largely dominated by General Purpose Graphics Processing Units (GPGPUs) due to the rich availability of software stacks to simplify their deployment on such platforms.
When performing the computations required by DNNs, GPGPUs improve the performance of CPUs by orders of magnitude thanks to their greater ability in performing parallel computations. Due to the pervasive application and the effectiveness of DNNs, there is an increasing research and industrial interest in applying deep learning in the context of embedded systems. In this context, it is crucial to ensure a time-predictable and energy-efficient execution of the computations required by DNNs. Unfortunately, GPGPUs generally fail in matching these requirements.
Field Programmable Gate Arrays (FPGAs) represent a very flexible alternative to GPGPUs for developing high-performance hardware acceleration. They generally provide better per-watt performance than GPGPUs, and are definitively more time-predictable. Indeed, they are considered a strong competitor of GPGPUs for deploying DNNs. On the other hand, programming FPGAs requires hardware-specific knowledge and they have been often considered platforms for specialists. However, during the last decade, several tools for adopting software-like programming models for FPGAs (e.g., high-level synthesis) became quite mature making FPGAs a more attractive choice for deploying DNNs.
Nevertheless, even with the availability of these tools, the acceleration of DNNs is particularly challenging due to the very complex computations they require. For this reason, several frameworks have been developed to help deploy DNNs on FPGAs. These frameworks generate hardware accelerators for DNNs that are extremely demanding in terms of FPGA resources and are often incompatible with platforms with small FPGA fabric. However, in most practical cases, system designers could need to deploy other hardware modules on FPGAs together with the ones required for DNNs acceleration, or need to dispose of DNNs acceleration on resource-constrained FPGAs (e.g., in small, low-cost embedded devices).

This thesis addresses this issue by specifically considering DNNs accelerators for resource-constrained FPGAs. We leverage Dynamic Partial Reconfiguration (DPR), a prominent feature of modern FPGAs that allows reconfiguring at run-time a portion of their area while the rest of the modules deployed on other portions continue to operate. To handle DNNs accelerators under DPR, we propose a DNNs partitioning technique that allows deploying DNNs accelerators on FPGA fabrics in which they would not entirely fit. DNNs are decomposed in partitions that are cyclically programmed on the FPGA. DNNs partitions are also optimized to improve their timing performance. Specifically, this work performs a in-depth analysis of the DNNs runtime performance in the presence of partitioning and explores the trade-offs between FPGA area consumption and execution time when optimizing the DNNs partitioning.