Tesi etd-04032019-234748 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
GUERRISI, GABRIELE
URN
etd-04032019-234748
Titolo
A Novel Deep Learning Solution for Predicting the Secondary Structure of RNA
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
INGEGNERIA BIOMEDICA
Relatori
relatore Prof. Bechini, Alessio
controrelatore Vozzi, Giovanni
controrelatore Vozzi, Giovanni
Parole chiave
- deep learning
- prediction
- prediction of secondary structure of RNA
- rna
Data inizio appello
24/04/2019
Consultabilità
Non consultabile
Data di rilascio
24/04/2089
Riassunto
This thesis work face up the RNA secondary structure prediction problem. It consist in the prediction of the secondary structure of the RNA given the primary structure (its sequence of bases). This has a great importance in biology since the RNA is the courier of the information linking the DNA with the the final biological element built. Moreover miRNA strings are gene expression regulators and there are many metabolic functions that could be fully understood.
The correct prediction of the RNA secondary structure would have many applications in many biologic fields, and in general would help to understand new roles of unknown string sequences or to create new ones from scratch with specific aims.
The secondary structure of a sequence of RNA is not easily predictable due to many factors.
The main problem to face up with is the dimension of the folding space: every sequence can fold and generate different structures depending on various conditions. This complexity leads to an NP-hard problem.
In order to tackle that, the problem has been identified and a Deep Learning approach has been used to search for solutions in different directions respect the classic methods. The model has been fed with only the sequences and no more information about thermodynamics or other constrains.
The dataset used is from RNA STRAND 2.0. Before the experiment a pre-processing step has been done on the input dataset in order to prepare the input data to be fed to the model as tensor with a one-hot encode of the sequences. Moreover a 5-fold cross-validation step has been added because of the limited dimension of the dataset.
The work exploits LSTM and Dense layers to build a model capable to predict with higher values of the metrics respect the classic methods. More experiments have been done implementing the Attention feature and the Encoder-Decoder engine with the Keras Functional API.
In the last part of the experiment the appropriate metrics are chosen and calculated.
Finally the results are analyzed and discussed. The last chapter contains a comparison among the Deep Learning experiment and the classic approaches. In particular the resulting metrics of the DeepFold model are compared with these of mfold, ProbKnot, CONTRAfold and RNAfold. As evidenced in the results the model has a potential to improve the actual results. Specifically it improves the single base prediction, predicting if the base will pair or not. Instead the network is still at early stages in understanding the very core concepts of RNA sequences and folding. The hardest learning for the model is the rule that every base that pairs is part of a couple of bases paired together. This is a bottleneck in the analysis of the results because the predicted sequence cannot be analyzed properly in terms of base pairs. A post-processing must be applied eventually. This leads to a new proposal to analyze the results. The available known metrics are used with new modifications in the formulas for cases when the biological meaning is lost with the prediction.
The programming language in this project is python and the Keras library has been used with the Tensorflow back-end in order to build the model. This has been eased also thanks to the Colab environment freely provided by Google.
The correct prediction of the RNA secondary structure would have many applications in many biologic fields, and in general would help to understand new roles of unknown string sequences or to create new ones from scratch with specific aims.
The secondary structure of a sequence of RNA is not easily predictable due to many factors.
The main problem to face up with is the dimension of the folding space: every sequence can fold and generate different structures depending on various conditions. This complexity leads to an NP-hard problem.
In order to tackle that, the problem has been identified and a Deep Learning approach has been used to search for solutions in different directions respect the classic methods. The model has been fed with only the sequences and no more information about thermodynamics or other constrains.
The dataset used is from RNA STRAND 2.0. Before the experiment a pre-processing step has been done on the input dataset in order to prepare the input data to be fed to the model as tensor with a one-hot encode of the sequences. Moreover a 5-fold cross-validation step has been added because of the limited dimension of the dataset.
The work exploits LSTM and Dense layers to build a model capable to predict with higher values of the metrics respect the classic methods. More experiments have been done implementing the Attention feature and the Encoder-Decoder engine with the Keras Functional API.
In the last part of the experiment the appropriate metrics are chosen and calculated.
Finally the results are analyzed and discussed. The last chapter contains a comparison among the Deep Learning experiment and the classic approaches. In particular the resulting metrics of the DeepFold model are compared with these of mfold, ProbKnot, CONTRAfold and RNAfold. As evidenced in the results the model has a potential to improve the actual results. Specifically it improves the single base prediction, predicting if the base will pair or not. Instead the network is still at early stages in understanding the very core concepts of RNA sequences and folding. The hardest learning for the model is the rule that every base that pairs is part of a couple of bases paired together. This is a bottleneck in the analysis of the results because the predicted sequence cannot be analyzed properly in terms of base pairs. A post-processing must be applied eventually. This leads to a new proposal to analyze the results. The available known metrics are used with new modifications in the formulas for cases when the biological meaning is lost with the prediction.
The programming language in this project is python and the Keras library has been used with the Tensorflow back-end in order to build the model. This has been eased also thanks to the Colab environment freely provided by Google.
File
Nome file | Dimensione |
---|---|
Tesi non consultabile. |