## Tesi etd-05222017-121423 |

Thesis type

Tesi di laurea magistrale

Author

BERNARDINI, GIULIA

URN

etd-05222017-121423

Title

Approximate Pattern Matching on Elastic-Degenerate Text

Struttura

MATEMATICA

Corso di studi

MATEMATICA

Supervisors

**relatore**Pisanti, Nadia

**correlatore**Pissis, Solon

Parole chiave

- palindromic decomposition
- degenerate string comparison
- multiple genomes
- palindromes
- pattern matching
- elastic-degenerate strings
- degenerate strings
- pan-genome

Data inizio appello

14/07/2017;

Consultabilità

Parziale

Data di rilascio

14/07/2020

Riassunto analitico

A pan-genome is a group of closely-related genomes that are meant to be analyzed jointly or to be used as a reference. Expecting an increasing rate of genome production, due to the advent of rapid and cheap ‘next-generation’ sequencing technologies, pan-genomes should ideally take over the role of linear reference genomes in comparative genomics: a fundamental challenge in then to find efficient methods that can map reads of a newly sequenced genome to some representation of a pan-genome. A convenient way to represent a multiple alignment of several closely-related genomes is through an elastic-degenerate string, that is a sequence of n sets of strings of total length N: the issue is then to find all matches of a pattern of length m in a text of this kind. There exists an O(nm² + N)-time algorithm to solve the exact version of this problem (i.e. the pattern is required to be identical to some substring of the text) on-line. The first goal of this work is to generalize the previous problem, switching from exact to inexact matching, under both the Hamming and the edit distance models: the switch is essential to deal with real data, that are always affected by a certain error rate. Two efficient on-line algorithms are presented: the first one runs in O(kmG + kN)-time and O(m)-space, where G is the total number of strings in the elastic-degenerate text and k is the maximum Hamming distance allowed; the other deals with the edit distance and requires time O(k²mG + kN) and space O(m). A final section is devoted to study a method for comparing two pan-genomes. It is first demonstrated that it is NP-hard to compare two elastic-degenerate strings, thus degenerate (not elastic) strings are introduced in order to make the problem treatable. A linear-time algorithm for comparing two degenerate strings is then presented and used as a subroutine for detecting palindromes in such a structure, which is a highly relevant problem in molecular biology.

File

Nome file | Dimensione |
---|---|

1 file non consultabili su richiesta dell'autore. |