Tesi etd-07082025-181954

Tipo di tesi

Tesi di dottorato di ricerca

Autore

MWANIKI, MOSES NJAGI

URN

etd-07082025-181954

Titolo

Matching to and between Pangenomes

Settore scientifico disciplinare

INF/01 - INFORMATICA

Corso di studi

INFORMATICA

Relatori

tutor Prof.ssa Pisanti, Nadia

Parole chiave

algorithms
alignment
D-strings
ED-strings
graphs
pangenomes
sequence

Data inizio appello

08/04/2025

Consultabilità

Completa

Riassunto

Chapter 1 lays the foundation for understanding computational pangenomics by highlighting its broader relevance, introducing its core concepts, and reviewing the latest developments in the field. It also situates the results of this thesis within the context of ongoing research, helping readers appreciate both the theoretical underpinnings and the practical implications of this work.

Chapter 2 then provides a high-level overview of the key data structures used to represent pangenomes. Starting with the fundamentals of pairwise sequence alignment, it builds up to the construction of the most straightforward pangenome representation---the multiple sequence alignment (MSA). The chapter then explores how to transform the MSA into various pangenome representations such as the Degenerate String (D-string), Elastic Degenerate String (ED-string), and variation graph. The chapter then concludes with a discussion of the strengths, limitations, and current advancements of various pangenomic representations.

Following these introductory chapters, the remainder of the thesis describes my original contributions to pangenomics.

Chapter 2, associated with the list of publications one through four, describes my contribution to the first draft of the human pangenome, which is the achievement of a large multidisciplinary consortium. After briefly describing the different steps and fields of study involved in constructing the pangenome, the focus switches to cutting-edge methods towards achieving chromosome-scale pairwise sequence alignment. The latter is a fundamental step in constructing real-life complex pangenomes such as the human pangenome. In particular, the chapter describes a strategy to improve a preprocessing step to pairwise alignment, which provides a possible better estimate of finding regions of similarity within segments of a pair of genomes.

Chapter 4, associated with conference publication two, details wavefront alignment, which will have been introduced in Chapter~\ref{chpt: reps}. The chapter details an extension of wavefront alignment to the alignment of a linear string to a D-string, detailing both the theoretical foundations of the method and the experimental validation of the tool developed through our implementation.

In Chapter 5, associated with journal publication seven and conference publication one, the wavefront alignment paradigm is further extended to the case of the global alignment of a linear string with an ED-string.
More specifically, it describes how to extend the commonly used set of edit operations that shape the \emph{edit transcript}-the sequence of operations that details the alignment-by introducing a new edit operation: the \emph{skip} operation. This somewhat fills a gap in the literature in the field of the \emph{progressive} multiple sequence alignment, that is, when an MSA is built with an iterative procedure by progressively aligning a new string onto an existing multiple alignment. This was usually done disregarding existing gaps, whereas the explicit management of skips forces the multiple sequence alignment to account for them.

Chapter 6, associated with journal publication six and conference publication three as well as preprint one, transitions from string-to-pangenome comparison onto pangenome-to-pangenome comparison. Focusing on ED-strings through the lens of automata, it describes how to outdo the naive automata-based approach that would require quadratic time complexity to quickly find whether two pangenomes share a genome; this is termed as the \emph{Elastic Degenerate String Intersection} (EDSI) problem. Furthermore, we design a data structure, which we shall call the intersection graph, to enable efficient responses to a broader range of real-world queries beyond the EDSI. Using the intersection graph, we show how to quickly compute two possible notions of local \emph{matching statistics} between pangenomes that we devised, as well as many consequent measures of similarity and distance metrics between pangenomes. Finally, we shall observe a validation of the method by applying the consequent tool to a SARS-CoV-2 dataset.

Finally, 7 concludes the thesis by summarizing the work presented and outlining promising directions for future research.

File

Nome file	Dimensione
PhD_Thesis.pdf	2.47 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-07082025-181954