Tesi etd-10042020-220325 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
BOMBARI, SIMONE
URN
etd-10042020-220325
Titolo
The dynamics of Stochastic Gradient Descent in the loss landscape of Deep Neural Networks
Dipartimento
FISICA
Corso di studi
FISICA
Relatori
relatore Prof. Soatto, Stefano
relatore Prof. Cataldo, Enrico
relatore Prof. Cataldo, Enrico
Parole chiave
- computer vision
- deep learning
- loss landscape
- machine learning
- optimization
- stochastic gradient descent
Data inizio appello
26/10/2020
Consultabilità
Tesi non consultabile
Riassunto
The deep learning optimization community has observed how the neural networks generalization ability is strongly related to the flatness of the loss landscape in the point the optimization algorithm converged to. Experiments show that SGD is more likely to converge to flat minima, unlike its deterministic counterpart, GD. In this work we try to build a mathematical model able to clarify this phenomenon, using a variation of the Eyring-Kramers law, a formula used in physics to describe the mean transition time of a Brownian particle between local minima in a potential landscape.
Later, we discuss the validity of the continuous approach for these purposes, showing how the SGD dynamics does not fulfill the necessary requirements for our architecture, since it is substantially a strongly discrete process. This result casts doubts on the validity of continuous-time approximation commonly used to analyze SGD dynamics through the theory of stochastic differential equations.
We finally try, with empirical experiments, to better investigate the loss landscape and the SGD trajectory of a real training process on a real neural network. We are therefore able to get an overview of the loss landscape topology, that we claim is in analogy with a tower of colanders. In particular, we find a natural constraint between the loss and the highest eigenvalue of its Hessian, meaning that we cannot achieve low values of the loss function, without entering in narrow areas of the landscape.
Later, we discuss the validity of the continuous approach for these purposes, showing how the SGD dynamics does not fulfill the necessary requirements for our architecture, since it is substantially a strongly discrete process. This result casts doubts on the validity of continuous-time approximation commonly used to analyze SGD dynamics through the theory of stochastic differential equations.
We finally try, with empirical experiments, to better investigate the loss landscape and the SGD trajectory of a real training process on a real neural network. We are therefore able to get an overview of the loss landscape topology, that we claim is in analogy with a tower of colanders. In particular, we find a natural constraint between the loss and the highest eigenvalue of its Hessian, meaning that we cannot achieve low values of the loss function, without entering in narrow areas of the landscape.
File
Nome file | Dimensione |
---|---|
Tesi non consultabile. |