Tesi etd-05252015-105805 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
FERRANTI, ANDREA
URN
etd-05252015-105805
Titolo
Multi-objective evolutionary fuzzy systems for Big Data on Apache Spark
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
COMPUTER ENGINEERING
Relatori
relatore Antonelli, Michela
relatore Prof. Marcelloni, Francesco
relatore Prof.ssa Lazzerini, Beatrice
relatore Prof. Marcelloni, Francesco
relatore Prof.ssa Lazzerini, Beatrice
Parole chiave
- Apache Spark
- Big Data
- FRBS
- Multi-objective evolutionary algorithms
- Multi-objective optimization
Data inizio appello
19/06/2015
Consultabilità
Non consultabile
Data di rilascio
19/06/2085
Riassunto
Over the last few years, the generation of fuzzy rule-based systems (FRBSs) from data has been tackled by using a multi-objective optimization approach, with accuracy and interpretability as the objectives to be optimized. Multi-objective evolutionary algorithms (MOEA) have been so often used in this context that the FRBSs generated by exploiting these algorithms have been denoted as multi-objective evolutionary fuzzy systems (MOEFSs). In this thesis, we adopt an MOEA-based approach to learn concurrently the rule and data bases of fuzzy rule-based classifiers (FRBCs) and Mamdani fuzzy rule-based systems (MFRBSs). In particular, the rule bases are generated by exploiting a rule and condition selection (RCS) strategy, which selects a reduced number of rules from a heuristically generated set of candidate rules and a reduced number of conditions for each selected rule during the evolutionary process. As regards the data base learning, the membership function parameters of each linguistic term used in the rules are learned concurrently to the application of RCS.
One of the most critical aspects, which limits the use of MOEFSs, is the effort needed for their execution. This effort is strongly affected by the computation of the fitness, especially when the dataset is large.
To address this limitation we have exploited Apache Spark: a fast and general cluster computing system for Big Data applications. Spark extends the popular MapReduce model to efficiently support more types of computations, including iterative programs. It revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.
In the experimental part we first test our approach on twelve very large datasets, eight for classification and four for regression. Then, we compared the results obtained in classification and regression with the ones obtained by the well-know algorithm namely Decision-Tree. Moreover, in classification, we have compared our results with the ones obtained by the popular ensemble method namely Random-Forest. The results show that our approach generates FRBCs and MFRBS with accuracy comparable to, and sometimes better than, the other algorithms but with a significantly lower complexity.
Finally, we show the scalability of our approach by carrying out a number of experiments on a real-world big dataset. In particular, we evaluate the achievable speedup on a small computer cluster, highlighting the fact that the proposed approach allows handling big datasets even with modest hardware support.
One of the most critical aspects, which limits the use of MOEFSs, is the effort needed for their execution. This effort is strongly affected by the computation of the fitness, especially when the dataset is large.
To address this limitation we have exploited Apache Spark: a fast and general cluster computing system for Big Data applications. Spark extends the popular MapReduce model to efficiently support more types of computations, including iterative programs. It revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.
In the experimental part we first test our approach on twelve very large datasets, eight for classification and four for regression. Then, we compared the results obtained in classification and regression with the ones obtained by the well-know algorithm namely Decision-Tree. Moreover, in classification, we have compared our results with the ones obtained by the popular ensemble method namely Random-Forest. The results show that our approach generates FRBCs and MFRBS with accuracy comparable to, and sometimes better than, the other algorithms but with a significantly lower complexity.
Finally, we show the scalability of our approach by carrying out a number of experiments on a real-world big dataset. In particular, we evaluate the achievable speedup on a small computer cluster, highlighting the fact that the proposed approach allows handling big datasets even with modest hardware support.
File
Nome file | Dimensione |
---|---|
Tesi non consultabile. |