ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-09182018-155646


Tipo di tesi
Tesi di laurea magistrale
Autore
WANG, QIONGGE
URN
etd-09182018-155646
Titolo
Using Natural Language Processing and Data Mining Techniques for Amazon Reviews Data Analytics:A study
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA
Relatori
relatore Nanni, Mirco
correlatore Prof. Attardi, Giuseppe
Parole chiave
  • Visualization
  • Analysis.
  • Natural Language Processing
  • Deep Learning
  • Algorithms
  • Data Mining
Data inizio appello
05/10/2018
Consultabilità
Non consultabile
Data di rilascio
05/10/2088
Riassunto
This thesis focused on data mining (also machine learning) algorithms for Amazon extracted
non-text dataset and text dataset for building artificial neural networks. Deep
learning algorithms nowadays used in a wide variety of domains for Amazon review
comments analysis. We already understood the pupular and useful algorithms for nontext
clustering (unsupervised learning) and classification (supervised learning). The specific
algorithms used theoretical explanation in previous chapters, those chapters also
illustrated the performance measures for training and validation datesets. The following
were the metrics selected : accuracy, classification error, precision, recall, F Measure ,
false Positive, false negative, true Positive, true Negative, sensitivity, specificity, positive
predictive value and negative Predictive Value, etc.

About text classification part, it theoretically described four different architectures:
convolutional (CNN), recurrent models like long short-term memory (LSTM) neural
networks, GRU and Bi-LSTM networks. These networks were explained in terms of
their structures, their building blocks—artificial neurones, and some learning algorithms:
backpropagation and backpropagation through time. Four architectures (CNN, GRU,
LSTM and Bi-LSTM), by setting different parameters, lots of experiments of tasks were
compiled and trained and tested. In the process we using Tensorflow/Keras frameworks
and trained network can be easily connected to any module in the python.

To accomplish and find the best word embedding method and best RNN networks
model, it is necessary to select a proper architecture and to optimize hyperparameters of
the network. Thus, the experimental procedure for comparing different architectures in
terms of their ability to learn, the effectivity of the training process, and the classification
performance was proposed and implemented in the previous chapter of this thesis. The
process also includes automatic optimization of neural network’s hyperparameters using
scikit-learn grid and random search functions.

A good understanding of the quality of the data was achieved by applying different
data mining and natural language processing techniques, moreover multiple visualization
ways like tables and graphs were created for intuitively and subjectively understanding
of each model.
File