logo SBA

ETD

Digital archive of theses discussed at the University of Pisa

 

Thesis etd-07042025-104732


Thesis type
Tesi di dottorato di ricerca
URN
etd-07042025-104732
Thesis title
Enhancing Public Administration with Computational Linguistics: a Language Model for Italian Bureacrutic Language
Academic discipline
GLOT-01/A - Historical and General Linguistics
Course of study
DISCIPLINE LINGUISTICHE E LETTERATURE STRANIERE
Keywords
  • administrative data
  • BureauBERTo
  • encoder
  • fine-tuning
  • further pre-training
  • Italian bureaucratic language
  • language model
  • prompting
  • public administration
  • specialized model
Graduation session start date
11/07/2025
Availability
Withheld
Release date
11/07/2028
Abstract (Inglese)
Abstract (Italiano)
This thesis addresses the automatic analysis of texts written in bureaucratic Italian through the development of resources and the identification of computational linguistics and NLP approaches applicable to data from the Italian Public Administration (PA), with the goal of supporting its digital transformation. The research focuses on two main areas of intervention: streamlining the processing of administrative documents and improving the readability of PA texts. Sector-specific languages, such as bureaucratic Italian, often pose challenges for general-purpose language models, which lack the linguistic knowledge required to accurately perform domain-specific tasks. To address this issue, the thesis describes the stages leading to the development of BureauBERTo, an encoder-based language model and the first to be specialized in the Italian bureaucratic domain. BureauBERTo’s performance was tested and compared to other models using supervised, unsupervised, and prompt-based learning approaches, demonstrating the effectiveness of specialized models in domain-specific tasks, even with limited annotated data. The research also showed that specialized encoders offer an efficient and more sustainable solution for discriminative tasks compared to current large language models, while ensuring internal data governance for public institutions and fostering AI applications that are accessible even to smaller entities within the public sector.
File