Tesi etd-08292022-105130

Tipo di tesi

Tesi di laurea magistrale

Autore

CANNARELLA, ROBERTO

URN

etd-08292022-105130

Titolo

MulTweEmo: A New Resource and Experiments on CLIP-based Multimodal Emotion Recognition

Dipartimento

FILOLOGIA, LETTERATURA E LINGUISTICA

Corso di studi

INFORMATICA UMANISTICA

Relatori

relatore Lenci, Alessandro
correlatore Passaro, Lucia C.

Parole chiave

CLIP language model
emotion recognition
grounded cognition
language resources
multimodal semantics
multimodal sentiment analysis

Data inizio appello

26/09/2022

Consultabilità

Tesi non consultabile

Riassunto

This work focuses on the image-text emotion recognition (ITER) task, which consists in training NLP models that, thanks to the combination of visual and textual information, can predict the emotion associated with a multimodal document.

Since research on ITER is still scarce, the first aim of this work is to better frame it within a solid theoretical framework. To do that, Chapter 1 presents contributions from cognitive linguistics and communication studies arguing that our language use is inherently multimodal and grounded on perceptual experience. The chapter also revises literature on how extra-linguistic information can be integrated into typical word embeddings, which yields the so-called visual-semantic embeddings. As a task, ITER is deeply connected with a few others: textual emotion recognition (ER), object recognition, and image emotion recognition (IER). For each of them, previous research, methods, and theoretical considerations are described in Chapter 2. Chapter 3, then, focuses on image-polarity classification, which is a task deeply connected with ITER, and finally reports the little existing research on ITER.

The rest of the work reports the creation of a new multimodal resource, MulTweEmo, and discusses the related experiments. Chapter 4 describes the creation process, including how labels have been defined, both automatically and, for a portion of the data, through crowdsourcing. Chapter 5 describes several experiments involving the use of the dataset: since the resource contains texts associated with pictures, it has been used to feed classifiers with both unimodal (textual or visual) and multimodal embeddings. All embeddings have been created by using the CLIP model. Experiments show that multimodal classifiers outperform the others across all experimental setups, suggesting that multimodality can be advantageous in tasks involving the analysis of the affective value of documents.

File

Nome file	Dimensione
Tesi non consultabile. Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-08292022-105130