Tesi etd-11092024-135954 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
ASANTE, MICHAEL
URN
etd-11092024-135954
Titolo
Multimodal Speech Recognition for Improved Transcription Accuracy
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING
Relatori
relatore Prof. Cimino, Mario Giovanni Cosimo Antonio
relatore Prof. Galatolo, Federico Andrea
relatore Prof. Cominelli, Lorenzo
relatore Prof. Galatolo, Federico Andrea
relatore Prof. Cominelli, Lorenzo
Parole chiave
- CNN
- humanoid robots
- lipreading
- RHRs
- speech articulation
Data inizio appello
26/11/2024
Consultabilità
Non consultabile
Data di rilascio
26/11/2027
Riassunto
This thesis presents a novel multi-modal approach that combines audio-based speech recognition with visual lip reading to enhance speech understanding in challenging acoustic environments. By leveraging on the key techniques in deep learning and computer vision, I demonstrate how the integration of visual information from lip movements can complement traditional audio-based speech recognition systems. The approach utilizes a three-stage pipeline: (1) audio processing with noise-robust speech-to-text conversion, (2)visual lip reading through a trained Convolutional Neural Network (CNN),and (3) an intelligent fusion approach of both modalities using large language models to produce accurate transcriptions.
Experimental results across a dataset of 10 diverse speech samples show that this multi-modal approach achieves superior recognition accuracy compared to either modality alone, particularly in environments with significant background noise. This thesis, thus contributes to the broader goal of creating more robust and adaptable communication interfaces for humanoid robots, ultimately improving their abiliy to interact naturally with humans in real world settings.
Experimental results across a dataset of 10 diverse speech samples show that this multi-modal approach achieves superior recognition accuracy compared to either modality alone, particularly in environments with significant background noise. This thesis, thus contributes to the broader goal of creating more robust and adaptable communication interfaces for humanoid robots, ultimately improving their abiliy to interact naturally with humans in real world settings.
File
Nome file | Dimensione |
---|---|
La tesi non è consultabile. |