logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-01252020-182713


Tipo di tesi
Tesi di laurea magistrale
Autore
SCHIRINZI, SIMONE
URN
etd-01252020-182713
Titolo
On-line Elastic Degenerate Strings q-gram distance with Suffix Trees
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA
Relatori
relatore Prof.ssa Pisanti, Nadia
Parole chiave
  • algorithm
  • bioinformatica
  • bioinformatics
  • ED-string
  • elastic degenerate
  • genoma
  • genome
  • pangenome
  • q-gram distance
  • strings
  • suffix tree
Data inizio appello
06/03/2020
Consultabilità
Tesi non consultabile
Riassunto
An elastic-degenerate (ED) string is a sequence of n sets of strings of total length N, which was recently proposed to model a set of p genomes of average length w. ED-strings find an interesting application in the representation of the genome of a population: a pan-genome. In this view, comparing two ED-strings is a way to discover similarities and differences between distinct (sub)populations.
In this thesis we study the ED-strings distance problem based on the so-called q-grams. We first define two distance notions, and we give three algorithms. The first two, one per each distance notion as a comparison base, are an extension of Ukkonen's algorithm and run in time O(Np) and O(Np log Np), respectively. The third is able to compute both q-gram distances between ED-strings in time O(N p^{q/w}). This is achieved using a data structure we devised named DIGIT: eD-strIng q-Gram suffIx Tree: a generalized suffix tree that keeps tracks of q-grams. To our knowledge, we are the first to define and apply q-gram distance notions with ED-strings.
File