logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-06222017-112501


Tipo di tesi
Tesi di laurea magistrale
Autore
RICATTO, MATTIA
URN
etd-06222017-112501
Titolo
Towards Effective Use of Massive Cancer Genomic Data with Cluster Computing Frameworks
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
INGEGNERIA BIOMEDICA
Relatori
relatore Dott. Bechini, Alessio
Parole chiave
  • apache spark
  • biological data mining
  • distributed algorithms
Data inizio appello
14/07/2017
Consultabilità
Completa
Riassunto
"Too much information, not enough knowledge" is one of the maxims of these first two decades of the 21th century. Thanks to the technological advances, an unprecedented amounts of data are now available, and these data collections become so large and complex - this is why they are called Big Data - that traditional data processing application software is inadequate to deal with them. Biomedical sciences are already massively contributing to the Big Data revolution, due to advances in genome sequencing technology and digital imaging, growth of clinical data warehouses, increased role of the patient in managing his own health information. In this work, thanks to Apache Spark - a fast and general engine for large-scale data processing, with built-in modules for streaming, SQL, machine learning and graph processing - it has been possible to work with The Cancer Genome Atlas data - a project that aims to catalogue genetic mutations responsible for cancer, using genome sequencing and bioinformatics - in order to develop a scalable and reproducible method for data preparation and data investigation Succesively, such method has been applied in order to investigate Copy Number Variations data with classification algorithms tailored for distribute computing on Apache Spark. The results are encouraging and underline the effectiveness of data mining on biomedical big data.
File