ETD system

Electronic theses and dissertations repository

 

Tesi etd-09142017-230811


Thesis type
Tesi di laurea magistrale
Author
BAGHERI AGHABABA, AMIR
URN
etd-09142017-230811
Title
On discretization of continuous attributes in Big Data mining
Struttura
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
COMPUTER ENGINEERING
Commissione
relatore Prof. Marcelloni, Francesco
relatore Prof. Bechini, Alessio
Parole chiave
  • Apache Spark
  • Discretization
  • Fuzzy Partitioning
  • Big Data
  • MapReduce
Data inizio appello
03/10/2017;
Consultabilità
parziale
Data di rilascio
03/10/2020
Riassunto analitico
In the vast domain of data mining with many algorithms and methods, coping with continuous features in data sets is a common issue. Discretization is the process of converting these continuous attributes into discrete intervals. Most of the data mining algorithms expect the attributes to be categorical and/or discrete. And if they can handle continuous attributes, they are having lower accuracies in comparison with those that work with discrete and categorical attributes. Hence, discretization is a very important issue to be addressed. Discretization has also been referred to as a technique for data and noise reduction. There are several methods represented in the field of discretization but most of them are designed to work with small datasets. In this thesis, we have implemented and compared different distributed fuzzy discretizers, namely fuzzy MDLP and fuzzy ur-CAIM, using Map-Reduce programming paradigm and Apache Spark framework. We have analyzed the behavior of these discretizers using distributed fuzzy decision tree with 9 well-known big datasets. These distributed discretizers can be more efficient in handling big data sets. We have also compared the two discretizers using different fuzzy membership functions. The results of the discretizers are analyzed and the reasons behind their behavior are discussed in this thesis.
File