ETD system

Electronic theses and dissertations repository

 

Tesi etd-06222017-112501


Thesis type
Tesi di laurea magistrale
Author
RICATTO, MATTIA
URN
etd-06222017-112501
Title
Towards Effective Use of Massive Cancer Genomic Data with Cluster Computing Frameworks
Struttura
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
INGEGNERIA BIOMEDICA
Supervisors
relatore Dott. Bechini, Alessio
Parole chiave
  • distributed algorithms
  • biological data mining
  • apache spark
Data inizio appello
14/07/2017;
Consultabilità
Parziale
Data di rilascio
14/07/2020
Riassunto analitico
"Too much information, not enough knowledge" is one of the maxims of these first two decades of the 21th century. Thanks to the technological advances, an unprecedented amounts of data are now available, and these data collections become so large and complex - this is why they are called Big Data - that traditional data processing application software is inadequate to deal with them. Biomedical sciences are already massively contributing to the Big Data revolution, due to advances in genome sequencing technology and digital imaging, growth of clinical data warehouses, increased role of the patient in managing his own health information. In this work, thanks to Apache Spark - a fast and general engine for large-scale data processing, with built-in modules for streaming, SQL, machine learning and graph processing - it has been possible to work with The Cancer Genome Atlas data - a project that aims to catalogue genetic mutations responsible for cancer, using genome sequencing and bioinformatics - in order to develop a scalable and reproducible method for data preparation and data investigation Succesively, such method has been applied in order to investigate Copy Number Variations data with classification algorithms tailored for distribute computing on Apache Spark. The results are encouraging and underline the effectiveness of data mining on biomedical big data.
File