logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-07052020-085158


Tipo di tesi
Tesi di laurea magistrale
Autore
CORCUERA BARCENA, JOSE LUIS
URN
etd-07052020-085158
Titolo
Fake News Detection through density-based data stream clustering
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
COMPUTER ENGINEERING
Relatori
relatore Marcelloni, Francesco
relatore Bechini, Alessio
relatore Bondielli, Alessandro
Parole chiave
  • Temporal Clustering
  • Outliers
  • Fake News
  • Streaming
  • DBSCAN
  • Clustering
Data inizio appello
20/07/2020
Consultabilità
Non consultabile
Data di rilascio
20/07/2090
Riassunto
As the USA presidential election in 2016 has proved, the spread of fake news can strongly influence the public opinion and is certainly today very dangerous. On the other hand, it is very hard to recognise fake news, especially when it appears. Further, people are the natural means for spreading fake news since they often share the news without evaluating its reliability.
For this reason, a lot of effort in the scientific community has been performed to automate the process of fake news detection. One popular approach is to check a blacklist where unreliable sources and authors are listed. However, although this solution is very simple, it is not actually effective in most cases since fake news can be published by sources which are not considered unreliable at the moment. More complex solutions have to be investigated, exploiting state of the art approaches in text mining and machine learning.
In this thesis, we propose to employ sentence embeddings representation and clustering algorithms for detecting fake news. In particular, we aim to represent news as real vectors and exploiting the cosine similarities for determining similar news. We exploit a clustering algorithm for generating clusters of news considered to be reliable because they come from highly reliable sources. Outliers, that is, news far from each cluster are considered as fake news. As a clustering algorithm, we adopt the Temporal Streaming Fuzzy DBSCAN algorithm, which has been improved by introducing auto-estimation of the parameter values.
The approach is validated on datasets of tweets collected during specific events.
File