| Tesi etd-09072025-171516 | 
    Link copiato negli appunti
  
    Tipo di tesi
  
  
    Tesi di laurea magistrale
  
    Autore
  
  
    MAZZUCCO, LUCA  
  
    URN
  
  
    etd-09072025-171516
  
    Titolo
  
  
    Large deviations for deep transformer models
  
    Dipartimento
  
  
    MATEMATICA
  
    Corso di studi
  
  
    MATEMATICA
  
    Relatori
  
  
    relatore Prof. Agazzi, Andrea
  
    Parole chiave
  
  - Bayesian neural networks
- large deviation principle
- transformers
    Data inizio appello
  
  
    26/09/2025
  
    Consultabilità
  
  
    Completa
  
    Riassunto
  
  This thesis investigates the Large Deviation Principle (LDP) for Transformer models, a central architecture in modern deep learning. Large deviation theory provides a rigorous framework to quantify the probability of rare events and, when applied to neural networks, it captures fluctuations around their deterministic Gaussian process limits. These rare fluctuations are key to understanding the stability of learning dynamics and the universality of large-scale models.
We focus on a simplified transformer-like architecture, where query and key weights are assumed fixed or pre-trained, so that the model takes the form of a deep linear network. We build on recent results in the literature: it is known that in a Bayesian setting with Gaussian priors and Gaussian noise, and in the double large-scale limit (neurons, samples, and input dimension diverging at fixed ratios), the posterior covariance kernel can be expressed through the minimizer of an action functional.
The main contribution of this thesis is to reinterpret such results within the large deviations framework, providing a partial unification of perspectives developed in distinct contexts. In particular, when the layer width diverges while the input dimension and dataset size remain finite, we identify the action as a rate function, thereby connecting this line of work with the broader literature on large deviations for covariance processes.
We focus on a simplified transformer-like architecture, where query and key weights are assumed fixed or pre-trained, so that the model takes the form of a deep linear network. We build on recent results in the literature: it is known that in a Bayesian setting with Gaussian priors and Gaussian noise, and in the double large-scale limit (neurons, samples, and input dimension diverging at fixed ratios), the posterior covariance kernel can be expressed through the minimizer of an action functional.
The main contribution of this thesis is to reinterpret such results within the large deviations framework, providing a partial unification of perspectives developed in distinct contexts. In particular, when the layer width diverges while the input dimension and dataset size remain finite, we identify the action as a rate function, thereby connecting this line of work with the broader literature on large deviations for covariance processes.
    File
  
  | Nome file | Dimensione | 
|---|---|
| Tesi_Mazzucco.pdf | 636.87 Kb | 
| Contatta l’autore | |
 
		