On-line Elastic Degenerate Strings q-gram distance with Suffix Trees
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA
Relatori
relatore Prof.ssa Pisanti, Nadia
Parole chiave
algorithm
bioinformatica
bioinformatics
ED-string
elastic degenerate
genoma
genome
pangenome
q-gram distance
strings
suffix tree
Data inizio appello
06/03/2020
Consultabilità
Tesi non consultabile
Riassunto
An elastic-degenerate (ED) string is a sequence of n sets of strings of total length N, which was recently proposed to model a set of p genomes of average length w. ED-strings find an interesting application in the representation of the genome of a population: a pan-genome. In this view, comparing two ED-strings is a way to discover similarities and differences between distinct (sub)populations. In this thesis we study the ED-strings distance problem based on the so-called q-grams. We first define two distance notions, and we give three algorithms. The first two, one per each distance notion as a comparison base, are an extension of Ukkonen's algorithm and run in time O(Np) and O(Np log Np), respectively. The third is able to compute both q-gram distances between ED-strings in time O(N p^{q/w}). This is achieved using a data structure we devised named DIGIT: eD-strIng q-Gram suffIx Tree: a generalized suffix tree that keeps tracks of q-grams. To our knowledge, we are the first to define and apply q-gram distance notions with ED-strings.