logo SBA

ETD

Digital archive of theses discussed at the University of Pisa

 

Thesis etd-01272023-171329


Thesis type
Tesi di laurea magistrale
Author
BIANCHI, LORENZO
URN
etd-01272023-171329
Thesis title
Design and development of cross-modal retrieval techniques based on transformer architectures
Department
INGEGNERIA DELL'INFORMAZIONE
Course of study
ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING
Supervisors
relatore Prof. Cimino, Mario Giovanni Cosimo Antonio
relatore Prof. Gennaro, Claudio
relatore Prof. Falchi, Fabrizio
relatore Dott. Messina, Nicola
Keywords
  • aladin
  • computer vision
  • cross-modality
  • deep learning
  • image-text retrieval
  • multi-modality
  • nlp
  • pytorch
  • transformer
Graduation session start date
17/02/2023
Availability
Full
Summary
Human beings experience the world in a multi-modal manner. We elaborate thoughts combining pieces of information about objects we see, sounds we hear, tactile sensations we feel, odors we smell, and so on. In the last years, the progresses in deep learning techniques made machines more capable of understanding the meaning of texts, images, audio, and videos. By understanding hidden semantics connections between these different types of unstructured data, we can elaborate jointly on this information to approach multi-modal problems, to resemble what humans do in everyday life. The work of this thesis will vert on the joint processing of images and natural language sentences. In particular, we will study the technologies behind cross-modal retrieval models between these two types of information. We will exploit new combinations of technologies and techniques to improve the results obtained by ALADIN, a cross-modal image-text retrieval model which reaches performances near the competitors, the large Vision-Language Transformers while being 90 times faster.

By introducing some modifications to the visual pipeline in the backbone of the architecture we were able to improve the model's performance. In particular, we improved the results presented in the original paper regarding the recall@k metric for the alignment head, the head of the model which aligns in a fine-grained manner the images and the texts representation. On the MS COCO dataset, we improved the rsum, the sum of the recall@k for the chosen k values (1,5 and 10), by 0.8 points on the 1K test set and by 3.5 points on the 5K test set.

The code to reproduce our results is available at https://github.com/lorebianchi98/ALADIN-2.0.
File