logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-07082025-181954


Tipo di tesi
Tesi di dottorato di ricerca
Autore
MWANIKI, MOSES NJAGI
URN
etd-07082025-181954
Titolo
Matching to and between Pangenomes
Settore scientifico disciplinare
INF/01 - INFORMATICA
Corso di studi
INFORMATICA
Relatori
tutor Prof.ssa Pisanti, Nadia
Parole chiave
  • algorithms
  • alignment
  • D-strings
  • ED-strings
  • graphs
  • pangenomes
  • sequence
Data inizio appello
08/04/2025
Consultabilità
Completa
Riassunto
Chapter 1 lays the foundation for understanding computational pangenomics by highlighting its broader relevance, introducing its core concepts, and reviewing the latest developments in the field. It also situates the results of this thesis within the context of ongoing research, helping readers appreciate both the theoretical underpinnings and the practical implications of this work.

Chapter 2 then provides a high-level overview of the key data structures used to represent pangenomes. Starting with the fundamentals of pairwise sequence alignment, it builds up to the construction of the most straightforward pangenome representation---the multiple sequence alignment (MSA). The chapter then explores how to transform the MSA into various pangenome representations such as the Degenerate String (D-string), Elastic Degenerate String (ED-string), and variation graph. The chapter then concludes with a discussion of the strengths, limitations, and current advancements of various pangenomic representations.

Following these introductory chapters, the remainder of the thesis describes my original contributions to pangenomics.

Chapter 2, associated with the list of publications one through four, describes my contribution to the first draft of the human pangenome, which is the achievement of a large multidisciplinary consortium. After briefly describing the different steps and fields of study involved in constructing the pangenome, the focus switches to cutting-edge methods towards achieving chromosome-scale pairwise sequence alignment. The latter is a fundamental step in constructing real-life complex pangenomes such as the human pangenome. In particular, the chapter describes a strategy to improve a preprocessing step to pairwise alignment, which provides a possible better estimate of finding regions of similarity within segments of a pair of genomes.

Chapter 4, associated with conference publication two, details wavefront alignment, which will have been introduced in Chapter~\ref{chpt: reps}. The chapter details an extension of wavefront alignment to the alignment of a linear string to a D-string, detailing both the theoretical foundations of the method and the experimental validation of the tool developed through our implementation.

In Chapter 5, associated with journal publication seven and conference publication one, the wavefront alignment paradigm is further extended to the case of the global alignment of a linear string with an ED-string.
More specifically, it describes how to extend the commonly used set of edit operations that shape the \emph{edit transcript}-the sequence of operations that details the alignment-by introducing a new edit operation: the \emph{skip} operation. This somewhat fills a gap in the literature in the field of the \emph{progressive} multiple sequence alignment, that is, when an MSA is built with an iterative procedure by progressively aligning a new string onto an existing multiple alignment. This was usually done disregarding existing gaps, whereas the explicit management of skips forces the multiple sequence alignment to account for them.

Chapter 6, associated with journal publication six and conference publication three as well as preprint one, transitions from string-to-pangenome comparison onto pangenome-to-pangenome comparison. Focusing on ED-strings through the lens of automata, it describes how to outdo the naive automata-based approach that would require quadratic time complexity to quickly find whether two pangenomes share a genome; this is termed as the \emph{Elastic Degenerate String Intersection} (EDSI) problem. Furthermore, we design a data structure, which we shall call the intersection graph, to enable efficient responses to a broader range of real-world queries beyond the EDSI. Using the intersection graph, we show how to quickly compute two possible notions of local \emph{matching statistics} between pangenomes that we devised, as well as many consequent measures of similarity and distance metrics between pangenomes. Finally, we shall observe a validation of the method by applying the consequent tool to a SARS-CoV-2 dataset.

Finally, 7 concludes the thesis by summarizing the work presented and outlining promising directions for future research.
File