logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-06302025-122029


Tipo di tesi
Tesi di laurea magistrale
Autore
PELLEGRINO, CLETO
URN
etd-06302025-122029
Titolo
Developing Multimodal Deep Learning Models for Technical Documents Representation and Retrieval
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING
Relatori
relatore Prof. Cimino, Mario Giovanni Cosimo Antonio
relatore Prof. Galatolo, Federico Andrea
relatore Consoloni, Marco
Parole chiave
  • CLIP
  • embeddings
  • information retrieval
  • multimodal retrieval
  • technical documents
  • technical drawings
Data inizio appello
23/07/2025
Consultabilità
Non consultabile
Data di rilascio
23/07/2095
Riassunto
Existing approaches for information retrieval applied on technical documents have overlooked the potential of analyzing technical drawings in conjunction with textual data to enhance the accuracy of prior art searches. Integrating visual information with textual information to bridge the gap between visual perception and language understanding can significantly support the Engineering Design (ED) process. The ED process is a series of steps that engineers follow to find a solution to a problem. The steps include problem solving processes such as, for example, determining your objectives and constraints, prototyping, testing and evaluation. In fact, images provide synthetic representation of design artifact/process, and they emerge as the primary mode of communications among innovators, engineers, and designers throughout the ED phases. Pre-existing works did not perform tasks like image rotation, segmentation or specializing on a specific domain, increasing the complexity of the input given to the retrieval system and making more difficult to find problems if they arise. The proposed solution comprises a full dataset creation pipeline to polish the input and create descriptions aligning images with sentences in documents text. A model architecture definition, specialized for the given problem, is also provided. The results shows that performing retrieval with embeddings generated by a CLIP like model may not be enough and that more complex information retrieval pipelines are needed in order to handle technical documents.
File