logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-05222017-121423


Tipo di tesi
Tesi di laurea magistrale
Autore
BERNARDINI, GIULIA
URN
etd-05222017-121423
Titolo
Approximate Pattern Matching on Elastic-Degenerate Text
Dipartimento
MATEMATICA
Corso di studi
MATEMATICA
Relatori
relatore Pisanti, Nadia
correlatore Pissis, Solon
Parole chiave
  • degenerate string comparison
  • degenerate strings
  • elastic-degenerate strings
  • multiple genomes
  • palindromes
  • palindromic decomposition
  • pan-genome
  • pattern matching
Data inizio appello
14/07/2017
Consultabilità
Non consultabile
Data di rilascio
14/07/2087
Riassunto
A pan-genome is a group of closely-related genomes that are meant to be analyzed jointly or to be used as a reference. Expecting an increasing rate of genome production, due to the advent of rapid and cheap ‘next-generation’ sequencing technologies, pan-genomes should ideally take over the role of linear reference genomes in comparative genomics: a fundamental challenge in then to find efficient methods that can map reads of a newly sequenced genome to some representation of a pan-genome. A convenient way to represent a multiple alignment of several closely-related genomes is through an elastic-degenerate string, that is a sequence of n sets of strings of total length N: the issue is then to find all matches of a pattern of length m in a text of this kind. There exists an O(nm² + N)-time algorithm to solve the exact version of this problem (i.e. the pattern is required to be identical to some substring of the text) on-line. The first goal of this work is to generalize the previous problem, switching from exact to inexact matching, under both the Hamming and the edit distance models: the switch is essential to deal with real data, that are always affected by a certain error rate. Two efficient on-line algorithms are presented: the first one runs in O(kmG + kN)-time and O(m)-space, where G is the total number of strings in the elastic-degenerate text and k is the maximum Hamming distance allowed; the other deals with the edit distance and requires time O(k²mG + kN) and space O(m). A final section is devoted to study a method for comparing two pan-genomes. It is first demonstrated that it is NP-hard to compare two elastic-degenerate strings, thus degenerate (not elastic) strings are introduced in order to make the problem treatable. A linear-time algorithm for comparing two degenerate strings is then presented and used as a subroutine for detecting palindromes in such a structure, which is a highly relevant problem in molecular biology.
File