ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-07032018-091816


Tipo di tesi
Tesi di laurea magistrale
Autore
BAGLINI, LORENZO
URN
etd-07032018-091816
Titolo
Development and analysis of Distributed Fuzzy Decision Tree on Spark
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
COMPUTER ENGINEERING
Relatori
relatore Prof. Bechini, Alessio
Parole chiave
  • fuzzy partitioning
  • fuzzy entropy
  • fuzzy discretization
  • fuzzy decision tree
  • Big Data
  • Apache Spark
Data inizio appello
20/07/2018
Consultabilità
Completa
Riassunto
Decision trees (DT) are widely used classifiers, employed in many different application domains such as security assessment, health system and road traffic congestion. The popularity of decision trees is mainly due to the simplicity of the learning schema and to the interpretability, so they can explain how an output is inferred from the inputs. In a decision tree, each internal node denotes a test on an attribute, each branch represents the outcome of the test, and each leaf node holds a class label. The topmost node is called the root node.
In this thesis, we deal with fuzzy decision trees (FDTs), which integrates decision trees with the fuzzy set theory, and we also compare performance w.r.t crisp decision trees (CDT). Like classical decision trees, FDTs can be categorized into two main groups: binary split trees and multi-way split trees, depending on the splitting method. Binary split are characterized by recursively partitioning the attribute space into two subspaces, while multi-way split partition space into a number of subspaces, which can be more than two.
FDT learning schemes require that a fuzzy partition has been already defined upon each continuous attribute. For this reason, continuous attributes should be discretized, and this operation drastically affects the accuracy of the classifier.
In this thesis, we analyze the impact of discretization method, based on MDLP, using different fuzzy set shapes such as triangular, trapezoidal and pseudogaussian on six small dataset and three big dataset.
To deal with Big Data, we choose Apache Spark, a well known fast in-memory computational engine, which is very suitable for tree learning and in general for applications that need iterative computatons.
Then we will analyze the impact of limiting the number of fuzzy sets in discretization phase, the impact of the variation of the minimum number of instances on a node in classification phase and finally we will evaluate experiments with a new approach to determine the output label given the activation degree in leaves.
File