logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-01232024-160931


Tipo di tesi
Tesi di laurea magistrale
Autore
PATIMO, ANTONIO
URN
etd-01232024-160931
Titolo
Development of a robust turn-taking management system for conversational agents
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING
Relatori
relatore Prof. Cimino, Mario Giovanni Cosimo Antonio
relatore Prof. Galatolo, Federico Andrea
relatore Prof. Cominelli, Lorenzo
relatore Prof. Greco, Alberto
Parole chiave
  • turn-taking
  • human-robot interaction
  • voice activity detection
  • end of turn detection
Data inizio appello
13/02/2024
Consultabilità
Non consultabile
Data di rilascio
13/02/2094
Riassunto
In this thesis we presented the design and implementation of a remote real time noise robust turn-taking management system. Turn-taking protocols are fundamental in many social aspects and activities. These protocols define who is speaking or performing a certain action in a given period of time, considering that another person or agent is waiting for
that turn to complete in order to initiate their own turn. Humans are generally very good in this type of coordination, fluently understanding who is speaking, when the next speaker should start to speak, when to stop and whom is the next turn to speak, all this with very small gaps and little overlaps. On the other hand, different types of conversational agents different from humans, for example voice assistant or social robots, have problems managing these protocols leading to frequent misunderstandings, interruptions and long
response delays. What we give for granted is a very difficult task for this type
of agents, that needs some type of clues to understand when it’s their turn
to speak or who is the next speaker. One of the most important clues that turn-taking protocols take into consideration regarding the end of the turn of a speaking agent, is silence and in particular voice activity detection (VAD). The goal of this thesis is to implement remote real-time noise-robust end of turn detection module, based on voice activity detection using a neural voice activity detector, in order to use this module inside the Abel android, a hyperrealistic humanoid robot, used as a research platform in various AI application in the E. Piaggio research lab of the University of Pisa. Before this system, Abel used a silence-based turn-taking module that behaved only based on silence and not on the absence of speech, having bad performances in noisy scenarios. The idea was to develop a turn-taking module robust to noisy scenarios and that could be deployed remotely to be integrated with the rest of the software architecture of the android.

First we presented the current literature review on the main methods and models used in the end of turn detection tasks and their pros and cons. Then we presented the implementation of our silence-based solution. Lastly, we validated our solution with the creation of a novel dataset that could meet our test requirements, and we compared the performances of the system using another vad module, the one integrated with WebRTC.

Our solution uses the state-of-the-art WebRTC technology to reliably transmit in real time the audio from the speaker to the remote server. The Python server thanks to the aiortc library processes the incoming audio frames that will be analyzed by the Silero VAD voice activity detection model to detect the presence of voice in the audio. Here a turn based system manages the status of the current turn and updates in real time the current turn status using websockets to the external environment and in particular to the speech elaboration module of the Abel android.

In the Experiments and Results section we showed an analysis of the performances of the system both in terms of inference and end of turn detection latency, and in terms of accuracy of the system and in particular of the Silero VAD model that was compared with the WebRTC vad. The first analysis showed that the system has a detection latency compatible with the use case application, while the last analysis showed that the system has good detection capabilities even in noisy scenarios and that the Silero VAD model outperform the WebRTC vad showing better performances with the diminishing of the Signal to Noise Ratio (SNR), so in more noisy scenarios.
File