Tesi di laurea magistrale
Towards Effective Use of Massive Cancer Genomic Data with Cluster Computing Frameworks
Corso di studi
relatore Dott. Bechini, Alessio
- distributed algorithms
- biological data mining
- apache spark
Data inizio appello
Data di rilascio
"Too much information, not enough knowledge" is one of the maxims of these first two decades of the 21th century. Thanks to the technological advances, an unprecedented amounts of data are now available, and these data collections become so large and complex - this is why they are called Big Data - that traditional data processing application software is inadequate to deal with them. Biomedical sciences are already massively contributing to the Big Data revolution, due to advances in genome sequencing technology and digital imaging, growth of clinical data warehouses, increased role of the patient in managing his own health information. In this work, thanks to Apache Spark - a fast and general engine for large-scale data processing, with built-in modules for streaming, SQL, machine learning and graph processing - it has been possible to work with The Cancer Genome Atlas data - a project that aims to catalogue genetic mutations responsible for cancer, using genome sequencing and bioinformatics - in order to develop a scalable and reproducible method for data preparation and data investigation Succesively, such method has been applied in order to investigate Copy Number Variations data with classification algorithms tailored for distribute computing on Apache Spark. The results are encouraging and underline the effectiveness of data mining on biomedical big data.
1 file non consultabili su richiesta dell'autore.