ETD system

Electronic theses and dissertations repository


Tesi etd-05252015-105805

Thesis type
Tesi di laurea magistrale
Multi-objective evolutionary fuzzy systems for Big Data on Apache Spark
Corso di studi
relatore Antonelli, Michela
relatore Prof. Marcelloni, Francesco
relatore Prof.ssa Lazzerini, Beatrice
Parole chiave
  • Apache Spark
  • Big Data
  • FRBS
  • Multi-objective optimization
  • Multi-objective evolutionary algorithms
Data inizio appello
Data di rilascio
Riassunto analitico
Over the last few years, the generation of fuzzy rule-based systems (FRBSs) from data has been tackled by using a multi-objective optimization approach, with accuracy and interpretability as the objectives to be optimized. Multi-objective evolutionary algorithms (MOEA) have been so often used in this context that the FRBSs generated by exploiting these algorithms have been denoted as multi-objective evolutionary fuzzy systems (MOEFSs). In this thesis, we adopt an MOEA-based approach to learn concurrently the rule and data bases of fuzzy rule-based classifiers (FRBCs) and Mamdani fuzzy rule-based systems (MFRBSs). In particular, the rule bases are generated by exploiting a rule and condition selection (RCS) strategy, which selects a reduced number of rules from a heuristically generated set of candidate rules and a reduced number of conditions for each selected rule during the evolutionary process. As regards the data base learning, the membership function parameters of each linguistic term used in the rules are learned concurrently to the application of RCS.
One of the most critical aspects, which limits the use of MOEFSs, is the effort needed for their execution. This effort is strongly affected by the computation of the fitness, especially when the dataset is large.
To address this limitation we have exploited Apache Spark: a fast and general cluster computing system for Big Data applications. Spark extends the popular MapReduce model to efficiently support more types of computations, including iterative programs. It revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.
In the experimental part we first test our approach on twelve very large datasets, eight for classification and four for regression. Then, we compared the results obtained in classification and regression with the ones obtained by the well-know algorithm namely Decision-Tree. Moreover, in classification, we have compared our results with the ones obtained by the popular ensemble method namely Random-Forest. The results show that our approach generates FRBCs and MFRBS with accuracy comparable to, and sometimes better than, the other algorithms but with a significantly lower complexity.
Finally, we show the scalability of our approach by carrying out a number of experiments on a real-world big dataset. In particular, we evaluate the achievable speedup on a small computer cluster, highlighting the fact that the proposed approach allows handling big datasets even with modest hardware support.