Tesi etd-01272023-171329

Tipo di tesi

Tesi di laurea magistrale

Autore

BIANCHI, LORENZO

URN

etd-01272023-171329

Titolo

Design and development of cross-modal retrieval techniques based on transformer architectures

Dipartimento

INGEGNERIA DELL'INFORMAZIONE

Corso di studi

ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING

Relatori

relatore Prof. Cimino, Mario Giovanni Cosimo Antonio
relatore Prof. Gennaro, Claudio
relatore Prof. Falchi, Fabrizio
relatore Dott. Messina, Nicola

Parole chiave

aladin
computer vision
cross-modality
deep learning
image-text retrieval
multi-modality
nlp
pytorch
transformer

Data inizio appello

17/02/2023

Consultabilità

Completa

Riassunto

Human beings experience the world in a multi-modal manner. We elaborate thoughts combining pieces of information about objects we see, sounds we hear, tactile sensations we feel, odors we smell, and so on. In the last years, the progresses in deep learning techniques made machines more capable of understanding the meaning of texts, images, audio, and videos. By understanding hidden semantics connections between these different types of unstructured data, we can elaborate jointly on this information to approach multi-modal problems, to resemble what humans do in everyday life. The work of this thesis will vert on the joint processing of images and natural language sentences. In particular, we will study the technologies behind cross-modal retrieval models between these two types of information. We will exploit new combinations of technologies and techniques to improve the results obtained by ALADIN, a cross-modal image-text retrieval model which reaches performances near the competitors, the large Vision-Language Transformers while being 90 times faster.

By introducing some modifications to the visual pipeline in the backbone of the architecture we were able to improve the model's performance. In particular, we improved the results presented in the original paper regarding the recall@k metric for the alignment head, the head of the model which aligns in a fine-grained manner the images and the texts representation. On the MS COCO dataset, we improved the rsum, the sum of the recall@k for the chosen k values (1,5 and 10), by 0.8 points on the 1K test set and by 3.5 points on the 5K test set.

The code to reproduce our results is available at https://github.com/lorebianchi98/ALADIN-2.0.

File

Nome file	Dimensione
BIANCHI_LORENZO.pdf	4.02 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-01272023-171329