Tesi etd-03252024-111027 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
BELLIZZI, LEONARDO
URN
etd-03252024-111027
Titolo
Real-time heart rate estimation with visions transformers
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING
Relatori
relatore Prof. Tonellotto, Nicola
relatore Prof. Ducange, Pietro
relatore Prof. Vallati, Carlo
relatore Prof. Ducange, Pietro
relatore Prof. Vallati, Carlo
Parole chiave
- heart rate prediction
- neural networks
- PyTorch
- vision transformers
- web application
Data inizio appello
17/04/2024
Consultabilità
Non consultabile
Data di rilascio
17/04/2094
Riassunto
This thesis is positioned within the field of health-care. The main objective is to be able to predict the heart rate from facial images, after extracting features from them using neural networks.
Specifically, the first objective is to find a neural network architecture capable of performing this regression task with a level of performance similar to that obtained by baseline models. This model is the core of a web application built to predict HR in real time on the video streaming of the PC webcam and allowing the user to make the predictions also on videos.
The first step involved a review of the state-of-the-art methodologies, to find which are benchmarks structures in this domain. CNN-based structures, known for their effectiveness in computer vision tasks, dominate even this field. EVMCNN and DeepPhys architectures stood out as benchmarks, with superior performance compared to their counterparts and for this, they were chosen as baseline models.
Taking into account the current state of the art, an approach utilizing the Vision Transformer (ViT) model has been developed, with modifications to suit the regression task.
The ViT implementation follows the Transformer model structure employed in Natural Language Processing tasks.
The ECG-Fitness Database was used as dataset, since it contains videos captured from 17 subjects engaged in six distinct physical activities, with each video linked to a corresponding CSV file containing heart rate data. The preprocessing phase was necessary to align video frames with the ECG CSV files to ensure temporal coherence. Face detection was performed every ten frames, to ensure a trade-off between data quantity and temporal information retention. The next step involved features extraction from the cheek regions of detected faces, through an algorithm, that from each raw frame, derives spectrograms representing temporal pixel variations and keeps relevant informations of heart rate frequencies.
Two dataset were set up, one allowing feature images from the same person to be present in both training and testing sets, while the other with separation by individuals, such that each person appeared in the training or test.
Models have been trained and tested on both dataset. Even if models suffered of performance degradation in the individual-separated dataset, as happened to DeepPhys researchers, ViT consistently outperformed the other architectures across both scenarios, reaching first objective.
Given ViT's performance, it was chosen as the most suitable model for real-time heart rate estimation in the web application, reaching second objective.
Specifically, the first objective is to find a neural network architecture capable of performing this regression task with a level of performance similar to that obtained by baseline models. This model is the core of a web application built to predict HR in real time on the video streaming of the PC webcam and allowing the user to make the predictions also on videos.
The first step involved a review of the state-of-the-art methodologies, to find which are benchmarks structures in this domain. CNN-based structures, known for their effectiveness in computer vision tasks, dominate even this field. EVMCNN and DeepPhys architectures stood out as benchmarks, with superior performance compared to their counterparts and for this, they were chosen as baseline models.
Taking into account the current state of the art, an approach utilizing the Vision Transformer (ViT) model has been developed, with modifications to suit the regression task.
The ViT implementation follows the Transformer model structure employed in Natural Language Processing tasks.
The ECG-Fitness Database was used as dataset, since it contains videos captured from 17 subjects engaged in six distinct physical activities, with each video linked to a corresponding CSV file containing heart rate data. The preprocessing phase was necessary to align video frames with the ECG CSV files to ensure temporal coherence. Face detection was performed every ten frames, to ensure a trade-off between data quantity and temporal information retention. The next step involved features extraction from the cheek regions of detected faces, through an algorithm, that from each raw frame, derives spectrograms representing temporal pixel variations and keeps relevant informations of heart rate frequencies.
Two dataset were set up, one allowing feature images from the same person to be present in both training and testing sets, while the other with separation by individuals, such that each person appeared in the training or test.
Models have been trained and tested on both dataset. Even if models suffered of performance degradation in the individual-separated dataset, as happened to DeepPhys researchers, ViT consistently outperformed the other architectures across both scenarios, reaching first objective.
Given ViT's performance, it was chosen as the most suitable model for real-time heart rate estimation in the web application, reaching second objective.
File
Nome file | Dimensione |
---|---|
La tesi non è consultabile. |