Tesi etd-06222017-112501

Tipo di tesi

Tesi di laurea magistrale

Autore

RICATTO, MATTIA

URN

etd-06222017-112501

Titolo

Towards Effective Use of Massive Cancer Genomic Data with Cluster Computing Frameworks

Dipartimento

INGEGNERIA DELL'INFORMAZIONE

Corso di studi

INGEGNERIA BIOMEDICA

Relatori

relatore Dott. Bechini, Alessio

Parole chiave

apache spark
biological data mining
distributed algorithms

Data inizio appello

14/07/2017

Consultabilità

Completa

Riassunto

"Too much information, not enough knowledge" is one of the maxims of these first two decades of the 21th century. Thanks to the technological advances, an unprecedented amounts of data are now available, and these data collections become so large and complex - this is why they are called Big Data - that traditional data processing application software is inadequate to deal with them. Biomedical sciences are already massively contributing to the Big Data revolution, due to advances in genome sequencing technology and digital imaging, growth of clinical data warehouses, increased role of the patient in managing his own health information. In this work, thanks to Apache Spark - a fast and general engine for large-scale data processing, with built-in modules for streaming, SQL, machine learning and graph processing - it has been possible to work with The Cancer Genome Atlas data - a project that aims to catalogue genetic mutations responsible for cancer, using genome sequencing and bioinformatics - in order to develop a scalable and reproducible method for data preparation and data investigation Succesively, such method has been applied in order to investigate Copy Number Variations data with classification algorithms tailored for distribute computing on Apache Spark. The results are encouraging and underline the effectiveness of data mining on biomedical big data.

File

Nome file	Dimensione
Tesi_intera.pdf	5.33 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-06222017-112501