logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-08292022-105130


Tipo di tesi
Tesi di laurea magistrale
Autore
CANNARELLA, ROBERTO
URN
etd-08292022-105130
Titolo
MulTweEmo: A New Resource and Experiments on CLIP-based Multimodal Emotion Recognition
Dipartimento
FILOLOGIA, LETTERATURA E LINGUISTICA
Corso di studi
INFORMATICA UMANISTICA
Relatori
relatore Lenci, Alessandro
correlatore Passaro, Lucia C.
Parole chiave
  • multimodal semantics
  • grounded cognition
  • emotion recognition
  • multimodal sentiment analysis
  • CLIP language model
  • language resources
Data inizio appello
26/09/2022
Consultabilità
Tesi non consultabile
Riassunto
This work focuses on the image-text emotion recognition (ITER) task, which consists in training NLP models that, thanks to the combination of visual and textual information, can predict the emotion associated with a multimodal document.

Since research on ITER is still scarce, the first aim of this work is to better frame it within a solid theoretical framework. To do that, Chapter 1 presents contributions from cognitive linguistics and communication studies arguing that our language use is inherently multimodal and grounded on perceptual experience. The chapter also revises literature on how extra-linguistic information can be integrated into typical word embeddings, which yields the so-called visual-semantic embeddings. As a task, ITER is deeply connected with a few others: textual emotion recognition (ER), object recognition, and image emotion recognition (IER). For each of them, previous research, methods, and theoretical considerations are described in Chapter 2. Chapter 3, then, focuses on image-polarity classification, which is a task deeply connected with ITER, and finally reports the little existing research on ITER.

The rest of the work reports the creation of a new multimodal resource, MulTweEmo, and discusses the related experiments. Chapter 4 describes the creation process, including how labels have been defined, both automatically and, for a portion of the data, through crowdsourcing. Chapter 5 describes several experiments involving the use of the dataset: since the resource contains texts associated with pictures, it has been used to feed classifiers with both unimodal (textual or visual) and multimodal embeddings. All embeddings have been created by using the CLIP model. Experiments show that multimodal classifiers outperform the others across all experimental setups, suggesting that multimodality can be advantageous in tasks involving the analysis of the affective value of documents.
File