logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-02112025-171657


Tipo di tesi
Tesi di laurea magistrale
Autore
AIELLO, ANNACHIARA
URN
etd-02112025-171657
Titolo
Evaluating Automatic Speech Recognition Performance on Non-Native Italian Speech
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA
Relatori
relatore Cucchiarini, Catia
relatore Strik, Helmer
relatore Bacciu, Davide
Parole chiave
  • automatic speech recognition
  • non-native Italian
  • word error rate
Data inizio appello
28/02/2025
Consultabilità
Tesi non consultabile
Riassunto
The present study evaluates ASR performance on a group of non-native Italian corpora with speech types as closely related as possible, and compares them to native benchmarks with similar characteristics. The exploration of the factors affecting ASR performance on non-native speech has the ultimate goal of identifying potential avenues to improve it. Based on the available non-native corpora, the aim is to address the following Research Questions (RQs).
RQ1: How does the end-to-end ASR model WhisperX handle semi-spontaneous native Italian in terms of WER (Word Error Rate)? RQ2: How does this model perform on semi-spontaneous non-native Italian, in terms of WER? RQ3: How does WER performance vary according to the speakers' native language families and proficiency levels?
The experiments involve three WhisperX models (large-v1, large-v2, and large-v3) tested zero-shot on individual corpora and on aggregated corpora based on the available proficiency levels. The results reveal that even a robust model like WhisperX faces more challenges with non-native Italian compared to native Italian speech. These difficulties generally increase as proficiency levels decrease, and are more pronounced for speakers whose native languages belong to the Germanic family compared to the Romance family.
On the one hand, evaluating ASR performance on combinations of diverse corpora may be seen as a limitation, on the other this approach is ecologically more valid, as it reflects real-world and complex scenarios in which non-native speech is employed.
File