logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-07042023-172612


Tipo di tesi
Tesi di laurea magistrale
Autore
AMARANTE, TOMMASO
URN
etd-07042023-172612
Titolo
Design and Development of a System for Counting-Related Visual Question Answering
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING
Relatori
relatore Prof. Cimino, Mario Giovanni Cosimo Antonio
relatore Prof. Falchi, Fabrizio
relatore Dott. Messina, Nicola
relatore Dott. Ciampi, Luca
Parole chiave
  • visual question answering
  • benchmark
  • dataset
  • countingVQA
Data inizio appello
21/07/2023
Consultabilità
Completa
Riassunto
The challenging task of Visual Question Answering (VQA) requires a thorough
comprehension of both visual content and natural language processing. Open-
ended counting is a special case of VQA where the goal is to answer specific
and possibly complex questions about the number of objects present in images.
However, even if counting is essential in many real-world applications, the de-
velopment and assessment of counting algorithms within the VQA domain are
limited by the scarcity of particular annotations for counting-related questions
in existing VQA datasets. To fill this gap, in this dissertation, we present Object-
CountingVQA, a brand-new dataset that focuses on the CountingVQA task.
This new benchmark comprises more than 2000 images, with more than 5500
associated question-answer combinations. One feature of ObjectCountingVQA is
that, in comparison to other benchmarks, it comprises more complex questions
that include adjectives and spatial relationships; this provides a challenging setup
for current VQA algorithms. We build those question-answer pairs automatically
by using chatGPT, a popular artificial intelligence chatbot developed by OpenAI,
starting from the structured data of the scene graphs of Visual Genome, a popu-
lar dataset for object detection and visual understanding. Then, we use Ground-
ingDINO, a powerful open-ended object detector, to automatically validate the
pairs and perform a first selection of good candidates. Finally, to ensure that all
the questions and answers were accurate enough to be exploited as a reliable
benchmark, we manually checked the generated annotations.
To demonstrate the potential of our ObjectCountingVQA dataset, we conduct
an experimental evaluation using a state-of-the-art VQA model, MOVIE+MCAN.
According to our findings, this newly introduced benchmark presents fresh chal-
lenges for the current VQA models, emphasizing the demand for specific counting
methods and challenging benchmarks.
File