logo SBA

ETD

Digital archive of theses discussed at the University of Pisa

 

Thesis etd-05222017-121423


Thesis type
Tesi di laurea magistrale
Author
BERNARDINI, GIULIA
URN
etd-05222017-121423
Thesis title
Approximate Pattern Matching on Elastic-Degenerate Text
Department
MATEMATICA
Course of study
MATEMATICA
Supervisors
relatore Pisanti, Nadia
correlatore Pissis, Solon
Keywords
  • degenerate string comparison
  • degenerate strings
  • elastic-degenerate strings
  • multiple genomes
  • palindromes
  • palindromic decomposition
  • pan-genome
  • pattern matching
Graduation session start date
14/07/2017
Availability
Withheld
Release date
14/07/2087
Summary
A pan-genome is a group of closely-related genomes that are meant to be analyzed jointly or to be used as a reference. Expecting an increasing rate of genome production, due to the advent of rapid and cheap ‘next-generation’ sequencing technologies, pan-genomes should ideally take over the role of linear reference genomes in comparative genomics: a fundamental challenge in then to find efficient methods that can map reads of a newly sequenced genome to some representation of a pan-genome. A convenient way to represent a multiple alignment of several closely-related genomes is through an elastic-degenerate string, that is a sequence of n sets of strings of total length N: the issue is then to find all matches of a pattern of length m in a text of this kind. There exists an O(nm² + N)-time algorithm to solve the exact version of this problem (i.e. the pattern is required to be identical to some substring of the text) on-line. The first goal of this work is to generalize the previous problem, switching from exact to inexact matching, under both the Hamming and the edit distance models: the switch is essential to deal with real data, that are always affected by a certain error rate. Two efficient on-line algorithms are presented: the first one runs in O(kmG + kN)-time and O(m)-space, where G is the total number of strings in the elastic-degenerate text and k is the maximum Hamming distance allowed; the other deals with the edit distance and requires time O(k²mG + kN) and space O(m). A final section is devoted to study a method for comparing two pan-genomes. It is first demonstrated that it is NP-hard to compare two elastic-degenerate strings, thus degenerate (not elastic) strings are introduced in order to make the problem treatable. A linear-time algorithm for comparing two degenerate strings is then presented and used as a subroutine for detecting palindromes in such a structure, which is a highly relevant problem in molecular biology.
File