logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-09142017-230811


Tipo di tesi
Tesi di laurea magistrale
Autore
BAGHERI AGHABABA, AMIR
URN
etd-09142017-230811
Titolo
On discretization of continuous attributes in Big Data mining
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
COMPUTER ENGINEERING
Relatori
relatore Prof. Marcelloni, Francesco
relatore Prof. Bechini, Alessio
Parole chiave
  • Fuzzy Partitioning
  • Discretization
  • Big Data
  • Apache Spark
  • MapReduce
Data inizio appello
03/10/2017
Consultabilità
Non consultabile
Data di rilascio
03/10/2087
Riassunto
In the vast domain of data mining with many algorithms and methods, coping with continuous features in data sets is a common issue. Discretization is the process of converting these continuous attributes into discrete intervals. Most of the data mining algorithms expect the attributes to be categorical and/or discrete. And if they can handle continuous attributes, they are having lower accuracies in comparison with those that work with discrete and categorical attributes. Hence, discretization is a very important issue to be addressed. Discretization has also been referred to as a technique for data and noise reduction. There are several methods represented in the field of discretization but most of them are designed to work with small datasets. In this thesis, we have implemented and compared different distributed fuzzy discretizers, namely fuzzy MDLP and fuzzy ur-CAIM, using Map-Reduce programming paradigm and Apache Spark framework. We have analyzed the behavior of these discretizers using distributed fuzzy decision tree with 9 well-known big datasets. These distributed discretizers can be more efficient in handling big data sets. We have also compared the two discretizers using different fuzzy membership functions. The results of the discretizers are analyzed and the reasons behind their behavior are discussed in this thesis.
File