Tesi etd-07032018-091816

Tipo di tesi

Tesi di laurea magistrale

Autore

BAGLINI, LORENZO

URN

etd-07032018-091816

Titolo

Development and analysis of Distributed Fuzzy Decision Tree on Spark

Dipartimento

INGEGNERIA DELL'INFORMAZIONE

Corso di studi

COMPUTER ENGINEERING

Relatori

relatore Prof. Bechini, Alessio

Parole chiave

Apache Spark
Big Data
fuzzy decision tree
fuzzy discretization
fuzzy entropy
fuzzy partitioning

Data inizio appello

20/07/2018

Consultabilità

Completa

Riassunto

Decision trees (DT) are widely used classifiers, employed in many different application domains such as security assessment, health system and road traffic congestion. The popularity of decision trees is mainly due to the simplicity of the learning schema and to the interpretability, so they can explain how an output is inferred from the inputs. In a decision tree, each internal node denotes a test on an attribute, each branch represents the outcome of the test, and each leaf node holds a class label. The topmost node is called the root node.
In this thesis, we deal with fuzzy decision trees (FDTs), which integrates decision trees with the fuzzy set theory, and we also compare performance w.r.t crisp decision trees (CDT). Like classical decision trees, FDTs can be categorized into two main groups: binary split trees and multi-way split trees, depending on the splitting method. Binary split are characterized by recursively partitioning the attribute space into two subspaces, while multi-way split partition space into a number of subspaces, which can be more than two.
FDT learning schemes require that a fuzzy partition has been already defined upon each continuous attribute. For this reason, continuous attributes should be discretized, and this operation drastically affects the accuracy of the classifier.
In this thesis, we analyze the impact of discretization method, based on MDLP, using different fuzzy set shapes such as triangular, trapezoidal and pseudogaussian on six small dataset and three big dataset.
To deal with Big Data, we choose Apache Spark, a well known fast in-memory computational engine, which is very suitable for tree learning and in general for applications that need iterative computatons.
Then we will analyze the impact of limiting the number of fuzzy sets in discretization phase, the impact of the variation of the minimum number of instances on a node in classification phase and finally we will evaluate experiments with a new approach to determine the output label given the activation degree in leaves.

File

Nome file	Dimensione
Thesis.pdf	2.29 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-07032018-091816