logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-01272023-171329


Tipo di tesi
Tesi di laurea magistrale
Autore
BIANCHI, LORENZO
URN
etd-01272023-171329
Titolo
Design and development of cross-modal retrieval techniques based on transformer architectures
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING
Relatori
relatore Prof. Cimino, Mario Giovanni Cosimo Antonio
relatore Prof. Gennaro, Claudio
relatore Prof. Falchi, Fabrizio
relatore Dott. Messina, Nicola
Parole chiave
  • aladin
  • computer vision
  • cross-modality
  • deep learning
  • image-text retrieval
  • multi-modality
  • nlp
  • pytorch
  • transformer
Data inizio appello
17/02/2023
Consultabilità
Completa
Riassunto
Human beings experience the world in a multi-modal manner. We elaborate thoughts combining pieces of information about objects we see, sounds we hear, tactile sensations we feel, odors we smell, and so on. In the last years, the progresses in deep learning techniques made machines more capable of understanding the meaning of texts, images, audio, and videos. By understanding hidden semantics connections between these different types of unstructured data, we can elaborate jointly on this information to approach multi-modal problems, to resemble what humans do in everyday life. The work of this thesis will vert on the joint processing of images and natural language sentences. In particular, we will study the technologies behind cross-modal retrieval models between these two types of information. We will exploit new combinations of technologies and techniques to improve the results obtained by ALADIN, a cross-modal image-text retrieval model which reaches performances near the competitors, the large Vision-Language Transformers while being 90 times faster.

By introducing some modifications to the visual pipeline in the backbone of the architecture we were able to improve the model's performance. In particular, we improved the results presented in the original paper regarding the recall@k metric for the alignment head, the head of the model which aligns in a fine-grained manner the images and the texts representation. On the MS COCO dataset, we improved the rsum, the sum of the recall@k for the chosen k values (1,5 and 10), by 0.8 points on the 1K test set and by 3.5 points on the 5K test set.

The code to reproduce our results is available at https://github.com/lorebianchi98/ALADIN-2.0.
File