ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-04132020-142514


Tipo di tesi
Tesi di laurea magistrale
Autore
PEDROTTI, ANDREA
URN
etd-04132020-142514
Titolo
Heterogeneous Document Embeddings for Multi-Lingual Text Classification
Dipartimento
FILOLOGIA, LETTERATURA E LINGUISTICA
Corso di studi
INFORMATICA UMANISTICA
Relatori
relatore Moreo Fernández, Alejandro
relatore Sebastiani, Fabrizio
Parole chiave
  • multi-lingual Text Classification
  • text Classification
  • word embeddings
  • Natural Language Processing
Data inizio appello
27/04/2020
Consultabilità
Non consultabile
Data di rilascio
27/04/2090
Riassunto
Supervised Text Classification (TC) is a NLP task in which, given a set of training documents labelled according to a finite number of classes, a classifier is trained so that it maps unlabelled documents to the class or classes to which they are assumed to belong, based on the document's content.
For a classifier to be trained, documents need first to be turned into vectorial representations.
While this has been traditionally achieved utilizing the BOW (``bag of words'') approach, the current research trend is to learn continuous and dense such representations, called embeddings.

Multi-lingual Text Classification (MLTC) is a specific setting of TC. In MLTC each document is written in one of a finite set of languages, and unlabelled documents need to be classified according to a common codeframe (or ``classification scheme'').

We approach MLTC by using funnelling, an algorithm originally proposed by Esuli et al. (2019). Funnelling is a two-tier ensemble-learning method, where the first tier trains language-dependent classifiers that generate document representations consisting of their posterior probabilities for the classes in the codeframe, and where the second tier trains a meta-classifier using all the (language-independent) probabilistic representations.

In this thesis we redesign funnelling by generalizing this procedure; we call the resulting framework Generalized Funnelling (gFun). In doing so, we enable gFun's meta-classifier to capitalize on different language-independent views of the document, that go beyond the document-class correlations captured by the posterior probabilities that are used in ``standard'' funnelling.

To exemplify such views, we experiment with embeddings derived from word-word correlations (for this we use MUSE embeddings (Conneau et al., 2018)) and embeddings derived from word-class correlations (for this we use WCE embeddings (Moreo et al., 2019)) aligned across languages.

The extensive empirical evaluation we have carried out seems indeed to confirm the hypothesis that multiple, language-independent views that capture different types of correlations, are beneficial for MLTC.
File