Tesi etd-04132020-142514

Tipo di tesi

Tesi di laurea magistrale

Autore

PEDROTTI, ANDREA

URN

etd-04132020-142514

Titolo

Heterogeneous Document Embeddings for Multi-Lingual Text Classification

Dipartimento

FILOLOGIA, LETTERATURA E LINGUISTICA

Corso di studi

INFORMATICA UMANISTICA

Relatori

relatore Moreo Fernández, Alejandro
relatore Sebastiani, Fabrizio

Parole chiave

multi-lingual Text Classification
Natural Language Processing
text Classification
word embeddings

Data inizio appello

27/04/2020

Consultabilità

Non consultabile

Data di rilascio

27/04/2090

Riassunto

Supervised Text Classification (TC) is a NLP task in which, given a set of training documents labelled according to a finite number of classes, a classifier is trained so that it maps unlabelled documents to the class or classes to which they are assumed to belong, based on the document's content.
For a classifier to be trained, documents need first to be turned into vectorial representations.
While this has been traditionally achieved utilizing the BOW (``bag of words'') approach, the current research trend is to learn continuous and dense such representations, called embeddings.

Multi-lingual Text Classification (MLTC) is a specific setting of TC. In MLTC each document is written in one of a finite set of languages, and unlabelled documents need to be classified according to a common codeframe (or ``classification scheme'').

We approach MLTC by using funnelling, an algorithm originally proposed by Esuli et al. (2019). Funnelling is a two-tier ensemble-learning method, where the first tier trains language-dependent classifiers that generate document representations consisting of their posterior probabilities for the classes in the codeframe, and where the second tier trains a meta-classifier using all the (language-independent) probabilistic representations.

In this thesis we redesign funnelling by generalizing this procedure; we call the resulting framework Generalized Funnelling (gFun). In doing so, we enable gFun's meta-classifier to capitalize on different language-independent views of the document, that go beyond the document-class correlations captured by the posterior probabilities that are used in ``standard'' funnelling.

To exemplify such views, we experiment with embeddings derived from word-word correlations (for this we use MUSE embeddings (Conneau et al., 2018)) and embeddings derived from word-class correlations (for this we use WCE embeddings (Moreo et al., 2019)) aligned across languages.

The extensive empirical evaluation we have carried out seems indeed to confirm the hypothesis that multiple, language-independent views that capture different types of correlations, are beneficial for MLTC.

File

Nome file	Dimensione
Tesi non consultabile. Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-04132020-142514