logo SBA


Digital archive of theses discussed at the University of Pisa


Thesis etd-04262022-181135

Thesis type
Tesi di dottorato di ricerca
Thesis title
Relational Learning in Computer Vision
Academic discipline
Course of study
tutor Dott. Falchi, Fabrizio
tutor Dott. Amato, Giuseppe
tutor Prof. Avvenuti, Marco
  • computer vision
  • deep learning
  • information retrieval
  • neural networks
  • vision and language
Graduation session start date
The increasing interest in social networks, smart cities, and Industry 4.0 is encouraging the development of techniques for processing, understanding, and organizing vast amounts of data. Recent important advances in Artificial Intelligence brought to life a subfield of Machine Learning called Deep Learning, which can automatically learn common patterns from raw data directly, without relying on manual feature selection. This framework overturned many computer science fields, like Computer Vision and Natural Language Processing, obtaining astonishing results. Nevertheless, many challenges are still open. Although deep neural networks obtained impressive results on many tasks, they cannot perform non-local processing by explicitly relating potentially interconnected visual or textual entities. This relational aspect is fundamental for capturing high-level semantic interconnections in multimedia data or understanding the relationships between spatially distant objects in an image.

This thesis tackles the relational understanding problem in Deep Neural Networks, considering three different yet related tasks. First, we introduce a challenging variant of the Content-Based Image Retrieval (CBIR) task, called Relational CBIR. In R-CBIR, we aim to retrieve images also having similar relationships among the multiple objects present in the images. We define some architectures able to extract relationship-aware visual descriptors, and we extend the CLEVR synthetic dataset for obtaining a suitable ground-truth for evaluating R-CBIR. Then, we move a step further, considering real-world images and focusing on cross-modal visual-textual retrieval. We use the Transformer Encoder, a recently introduced module that relies on the power of self-attention, to relate different sentence words and image regions, with large-scale retrieval as the main goal. We show that the obtained features contain very high-level semantics and defeat current image descriptors on the challenging Semantic CBIR task. We then propose some solutions for scaling the search to possibly millions of images or texts. In the end, we deploy the developed networks in a large-scale interactive video retrieval software, called VISIONE, developed in our laboratory. Sticking to the multi-modal Transformer framework, we tackle another critical task in the modern Internet: detecting persuasion techniques in memes spread on social networks during disinformation campaigns. Finally, we probe current state-of-the-art CNNs on challenging visual reasoning benchmarks requiring non-local spatial comparisons. After understanding the drawbacks of CNNs on these tasks, we propose a hybrid CNN-Transformer architecture, constraining the model complexity and reaching higher data efficiency.