logo SBA


Digital archive of theses discussed at the University of Pisa


Thesis etd-04022011-140328

Thesis type
Tesi di dottorato di ricerca
Thesis title
Data Mining of Biomedical Databases
Academic discipline
Course of study
tutor Prof. Landini, Luigi
  • bayesian network
  • biomedicine
  • data mining
  • microarray
Graduation session start date
Data mining can be defined as the nontrivial extraction of implicit, previously unknown and potentially useful information from data. This thesis is focused on Data Mining in Biomedicine, representing one of the most interesting fields of application. Different kinds of biomedical data sets would require different data mining approaches. Two approaches are treated in this thesis, divided in two separate and independent parts.
The first part deals with Bayesian Networks, representing one of the most successful tools for medical diagnosis and therapies follow-up. Formally, a Bayesian Network (BN) is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph. An algorithm for Bayesian network structure learning that is a variation of the standard search-and-score approach has been developed. The proposed approach overcomes the creation of redundant network structures that may include non significant connections between variables. In particular, the algorithm finds which relationships between the variables must be prevented, by exploiting the binarization of a square matrix containing the mutual information (MI) among all pairs of variables. Four different binarization methods are implemented. The MI binary matrix is exploited as a pre-conditioning step for the subsequent greedy search procedure that optimizes the network score, reducing the number of possible search paths in the greedy search procedure. This approach has been tested on four different datasets and compared against the standard search-and-score algorithm as implemented in the DEAL package, with successful results. Moreover, a comparison among different network scores has been performed.
The second part of this thesis is focused on data mining of microarray databases. An algorithm able to perform the analysis of Illumina microRNA microarray data in a systematic and easy way has been developed. The algorithm includes two parts. The first part is the pre-processing, characterized by two steps: variance stabilization and normalization. Variance stabilization has to be performed to abrogate or at least reduce the heteroskedasticity while normalization has to be performed to minimize systematic effects that are not constant among different samples of an experiment and that are not due to the factors under investigation. Three alternative variance stabilization strategies and three alternative normalization approaches are included. So, considering all the possible combinations between variance stabilization and normalization strategies, 9 different ways to pre-process the data are obtained. The second part of the algorithm deals with the statistical analysis for the differential expression detection. Linear models and empirical Bayes methods are used. The final result is the list of the microRNAs significantly differentially-expressed in two different conditions. The algorithm has been tested on three different real datasets and partially validated with an independent approach (quantitative real time PCR). Moreover, the influence of the use of different preprocessing methods on the discovery of differentially expressed microRNAs has been studied and a comparison among the different normalization methods has been performed. This is the first study comparing normalization techniques for Illumina microRNA microarray data.