Tesi etd-02012024-174031 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
ANGILE', EMANUELE
URN
etd-02012024-174031
Titolo
A Study of Implicit Bias in the Training Algorithms of Diagonal Linear Networks
Dipartimento
MATEMATICA
Corso di studi
MATEMATICA
Relatori
relatore Agazzi, Andrea
Parole chiave
- implicit bias
- machine learning
- stochastic differential equation
- stochastic gradient descent
Data inizio appello
23/02/2024
Consultabilità
Tesi non consultabile
Riassunto
Understanding the performance of neural networks poses one of the most intriguing challenges in the current machine learning landscape. One question that arises is: Why do they achieve good generalization performance, even without explicit use of regularization? Particularly, in overparameterized models where the training objective has many global minima, optimizing using a specific algorithm —typically gradient-based— implicitly biases the solutions toward certain special global minima.
Our work centers around the analysis of Diagonal Linear Networks (DLNs), a reparametrization of linear predictors. Despite its simplicity, this framework provides insights into how stochasticity contributes to generalization, leading to sparser solutions. We compare the dynamics of Gradient Descent (GD) and Stochastic Gradient Descent (SGD), examining the continuous processes that modelize them. In the deterministic case, we have the ODE associated with the Gradient Flow. On the other hand, for the stochastic counterpart, we construct the corresponding SDE, with a particular emphasis on the diffusion term. Motivated by the experimental results, we further investigate Momentum SGD (MSGD), as it exhibits superior performance in terms of generalization. We demonstrate that the SDE related to MSGD introduces an effective parameter that experiences a further decrease compared to SGD's counterpart.
As our work centers on the analysis of continuous processes, we conclude by establishing the connection between the discrete process (coinciding with the steps of the algorithm) and the continuous one. We present a classical result of weak convergence between the solution of an SDE and its discretization with a step-size approaching 0.
Our work centers around the analysis of Diagonal Linear Networks (DLNs), a reparametrization of linear predictors. Despite its simplicity, this framework provides insights into how stochasticity contributes to generalization, leading to sparser solutions. We compare the dynamics of Gradient Descent (GD) and Stochastic Gradient Descent (SGD), examining the continuous processes that modelize them. In the deterministic case, we have the ODE associated with the Gradient Flow. On the other hand, for the stochastic counterpart, we construct the corresponding SDE, with a particular emphasis on the diffusion term. Motivated by the experimental results, we further investigate Momentum SGD (MSGD), as it exhibits superior performance in terms of generalization. We demonstrate that the SDE related to MSGD introduces an effective parameter that experiences a further decrease compared to SGD's counterpart.
As our work centers on the analysis of continuous processes, we conclude by establishing the connection between the discrete process (coinciding with the steps of the algorithm) and the continuous one. We present a classical result of weak convergence between the solution of an SDE and its discretization with a step-size approaching 0.
File
Nome file | Dimensione |
---|---|
Tesi non consultabile. |