Tesi etd-01272025-115255 |
Link copiato negli appunti
Tipo di tesi
Tesi di dottorato di ricerca
Autore
BALDRATI, ALBERTO
URN
etd-01272025-115255
Titolo
From Retrieval to Generation: Multimodal Models for Vision and Language Tasks
Settore scientifico disciplinare
INF/01 - INFORMATICA
Corso di studi
DOTTORATO NAZIONALE IN INTELLIGENZA ARTIFICIALE
Relatori
tutor Prof. Bertini, Marco
supervisore Prof. Bagdanov, Andrew David
supervisore Prof. Bagdanov, Andrew David
Parole chiave
- artificial intelligence
- computer vision
- fashion image generation
- image retrieval
- virtual try-on
- vision-language models
Data inizio appello
19/02/2025
Consultabilità
Completa
Riassunto
This thesis explores the integration of visual and textual data across various tasks, from image retrieval to image generation. To address discriminative and generative tasks, we examine two types of multimodal models: Vision-Language Models (VLMs) for retrieval and classification, and text-to-image diffusion models for generative problems.
We begin by tackling the task of supervised Composed Image Retrieval (CIR), where the goal is to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. In this context, we propose a two-stage approach that adapts the CLIP model to CIR, achieving state-of-the-art results on the benchmark datasets FashionIQ and CIRR. Building on this, we then introduce a new task called Zero-Shot CIR (ZS-CIR), which aims to address CIR without requiring a labeled training dataset. The proposed method, iSEARLE (improved zeroShot composEd imAge Retrieval with textuaL invErsion), sets a new standard for ZS-CIR. Additionally, we support further research in this area by introducing the CIRCO dataset, the first CIR dataset to include multiple ground-truth images for each query, enabling a more comprehensive evaluation.
Following this, we explore the few-shot adaptation of VLMs by introducing KDPL (Knowledge Distillation Prompt Learning), a parameter-efficient, unsupervised prompt learning method based on knowledge distillation. KDPL can be seamlessly integrated into existing prompt learning techniques, with experiments demonstrating significant improvements in zero-shot generalization across more than ten benchmark datasets.
Next, the focus shifts to generative tasks in the fashion domain. We propose a latent diffusion model for multimodal fashion image editing, capable of generating realistic, human-centric fashion images from diverse inputs such as text descriptions, body poses, sketches, and fabric textures. To address the lack of appropriate datasets, we extend the Dress Code and VITON-HD datasets with multimodal annotations.
Finally, we further explore generative models within the fashion domain by tackling the virtual try-on task. We introduce LaDI-VTON, the first diffusion-based model for this task, which employs textual inversion to preserve garment texture details during image generation, offering significant improvements over existing methods.
We begin by tackling the task of supervised Composed Image Retrieval (CIR), where the goal is to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. In this context, we propose a two-stage approach that adapts the CLIP model to CIR, achieving state-of-the-art results on the benchmark datasets FashionIQ and CIRR. Building on this, we then introduce a new task called Zero-Shot CIR (ZS-CIR), which aims to address CIR without requiring a labeled training dataset. The proposed method, iSEARLE (improved zeroShot composEd imAge Retrieval with textuaL invErsion), sets a new standard for ZS-CIR. Additionally, we support further research in this area by introducing the CIRCO dataset, the first CIR dataset to include multiple ground-truth images for each query, enabling a more comprehensive evaluation.
Following this, we explore the few-shot adaptation of VLMs by introducing KDPL (Knowledge Distillation Prompt Learning), a parameter-efficient, unsupervised prompt learning method based on knowledge distillation. KDPL can be seamlessly integrated into existing prompt learning techniques, with experiments demonstrating significant improvements in zero-shot generalization across more than ten benchmark datasets.
Next, the focus shifts to generative tasks in the fashion domain. We propose a latent diffusion model for multimodal fashion image editing, capable of generating realistic, human-centric fashion images from diverse inputs such as text descriptions, body poses, sketches, and fabric textures. To address the lack of appropriate datasets, we extend the Dress Code and VITON-HD datasets with multimodal annotations.
Finally, we further explore generative models within the fashion domain by tackling the virtual try-on task. We introduce LaDI-VTON, the first diffusion-based model for this task, which employs textual inversion to preserve garment texture details during image generation, offering significant improvements over existing methods.
File
Nome file | Dimensione |
---|---|
AlbertoB...lPdfa.pdf | 113.56 Mb |
Contatta l’autore |