logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-09182023-153912


Tipo di tesi
Tesi di laurea magistrale
Autore
LARI, FILIPPO
URN
etd-09182023-153912
Titolo
A Search Engine For Source Code
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA
Relatori
relatore Prof. Ferragina, Paolo
Parole chiave
  • locality-sensitive hashing
  • minhash
  • clone search
  • code search
Data inizio appello
06/10/2023
Consultabilità
Non consultabile
Data di rilascio
06/10/2026
Riassunto
Nowadays software plays a central role in our era and source code is a particular kind of information produced in incredibly large amounts. The sheer amount of existing source code leads to a situation where most code to be written by a developer either has already been written elsewhere or, at least, is similar to some existing code. Recently, the Software Heritage, an ambitious initiative launched in 2015 by INRIA and supported by prestigious sponsors such as Google, Microsoft, GitHub, and the universities of Bologna and Pisa, is collecting all the publicly available software with the purpose of its preservation, since it is part of our cultural heritage. At the time of writing, Software Heritage is the world’s largest archive of source code with more than 16 billion source files, and over 3 billion commits coming from more than 250 million projects. Although Software Heritage is extremely clever in storing source code, it lacks a method for searching its enormous collection. This last challenge motivated the development of this thesis, in which we propose a novel method for efficiently indexing and effectively solving queries on large repositories of Java code. The proposed solution has been tested on a well-known benchmark achieving results comparable with the state-of-the-art while maintaining a fast query time and a low memory consumption.
File