Tesi etd-04262022-181135

Tipo di tesi

Tesi di dottorato di ricerca

URN

etd-04262022-181135

Titolo

Relational Learning in Computer Vision

Settore scientifico disciplinare

ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI

Corso di studi

INGEGNERIA DELL'INFORMAZIONE

Relatori

tutor Dott. Falchi, Fabrizio
tutor Dott. Amato, Giuseppe
tutor Prof. Avvenuti, Marco

Parole chiave

computer vision
deep learning
information retrieval
neural networks
vision and language

Data inizio appello

03/05/2022

Consultabilità

Completa

Riassunto (Inglese)

Riassunto (Italiano)

The increasing interest in social networks, smart cities, and Industry 4.0 is encouraging the development of techniques for processing, understanding, and organizing vast amounts of data. Recent important advances in Artificial Intelligence brought to life a subfield of Machine Learning called Deep Learning, which can automatically learn common patterns from raw data directly, without relying on manual feature selection. This framework overturned many computer science fields, like Computer Vision and Natural Language Processing, obtaining astonishing results. Nevertheless, many challenges are still open. Although deep neural networks obtained impressive results on many tasks, they cannot perform non-local processing by explicitly relating potentially interconnected visual or textual entities. This relational aspect is fundamental for capturing high-level semantic interconnections in multimedia data or understanding the relationships between spatially distant objects in an image.

This thesis tackles the relational understanding problem in Deep Neural Networks, considering three different yet related tasks. First, we introduce a challenging variant of the Content-Based Image Retrieval (CBIR) task, called Relational CBIR. In R-CBIR, we aim to retrieve images also having similar relationships among the multiple objects present in the images. We define some architectures able to extract relationship-aware visual descriptors, and we extend the CLEVR synthetic dataset for obtaining a suitable ground-truth for evaluating R-CBIR. Then, we move a step further, considering real-world images and focusing on cross-modal visual-textual retrieval. We use the Transformer Encoder, a recently introduced module that relies on the power of self-attention, to relate different sentence words and image regions, with large-scale retrieval as the main goal. We show that the obtained features contain very high-level semantics and defeat current image descriptors on the challenging Semantic CBIR task. We then propose some solutions for scaling the search to possibly millions of images or texts. In the end, we deploy the developed networks in a large-scale interactive video retrieval software, called VISIONE, developed in our laboratory. Sticking to the multi-modal Transformer framework, we tackle another critical task in the modern Internet: detecting persuasion techniques in memes spread on social networks during disinformation campaigns. Finally, we probe current state-of-the-art CNNs on challenging visual reasoning benchmarks requiring non-local spatial comparisons. After understanding the drawbacks of CNNs on these tasks, we propose a hybrid CNN-Transformer architecture, constraining the model complexity and reaching higher data efficiency.

File

Nome file	Dimensione
Final_Ph...ssina.pdf	198.75 Kb
PhD_Thesis.pdf	17.72 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-04262022-181135