ETD system

Electronic theses and dissertations repository


Tesi etd-04052020-135026

Thesis type
Tesi di laurea magistrale
Performance Analysis of Stream Processing Systems on Multi-cores
Corso di studi
relatore Mencagli, Gabriele
Parole chiave
  • data stream processing
  • parallel computing
  • performance analysis
  • Apache Storm
  • Apache Flink
  • WindFlow
Data inizio appello
Secretata d'ufficio
Riassunto analitico
Real-time requirements are becoming an increasingly common constraint of several existing large-scale applications needing to process large volumes of data in a timely manner. This has encouraged the development of Stream Processing Systems (SPSs) as general-purpose frameworks allowing application developers to focus mainly on the business logic code of their applications, while the provided abstractions hide low-level implementation tasks like resource scheduling and data exchange. Many state-of-the-art SPSs deal with high-throughput input streams by adopting a scale-out approach, i.e., by dividing the workload among several nodes of a distributed system. To this end, they rely on the Java Virtual Machine (JVM) for portability and popularity of this language. This distributed design fails to exploit the full potential of modern multi-core processors, since the provided processing bandwidth is often far from the memory bandwidth limit of the machine. This work selects two well-established distributed frameworks (Apache Flink and Storm) and compares their performance and programming model with WindFlow, a C++17 stream processing library that explicitly targets shared-memory systems. The benchmarks are based on two data streaming applications commonly used in prior works to evaluate the performance of SPSs. In the single-node multi-core scenario, our results show a substantial improvement in both throughput and latency for WindFlow when compared with the state-of-the-art frameworks. The main contribution of this thesis is to demonstrate that the obtained gain is enough to justify the investment of resources in developing SPSs that target shared-memory systems in addition to the distributed solutions existing so far.