ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-10022018-153541


Tipo di tesi
Tesi di dottorato di ricerca
Autore
MEONI, MARCO
Indirizzo email
marco.meoni@gmail.com
URN
etd-10022018-153541
Titolo
Mining Predictive Models for Big Data Placement
Settore scientifico disciplinare
INF/01
Corso di studi
INFORMATICA
Relatori
tutor Prof. Perego, Raffaele
correlatore Dott. Tonellotto, Nicola
Parole chiave
  • LHC
  • CMS
  • CERN
  • Caching
  • Classification
  • Dataset Popularity
  • Big Data
  • Machine Learning
Data inizio appello
19/10/2018
Consultabilità
Completa
Riassunto
The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) deploys its data collections, simulation and analysis activities on a distributed computing infrastructure involving more than 70 sites worldwide. Over the last few years, the historical usage data produced by this large infrastructure has been recorded on Big Data clusters featuring more than 5 Petabytes of raw storage with different open-source user-level tools available for analytical purposes. The clusters offer a broad variety of computing and storage logs that represent a valuable, yet scarcely investigated, source of information for system tuning and capacity planning. Amongst all, the problem of understanding and predicting dataset popularity is of primary interest for CERN. In fact, its solution can enable effective policies for the placement of mostly accessed datasets, thus resulting in remarkably shorter job latencies and increased system throughput and resource usage.
In this thesis, three key requirements for Petabyte-size dataset popularity models in a worldwide computing system such as the Worldwide LHC Computing Grid (WLCG) are investigated. Namely, the need of an efficient Hadoop data vault for an effective mining platform capable of collecting computing logs from different monitoring services and organizing them into periodic snapshots; the need of a scalable pipeline of machine learning tools for training, on these snapshots, predictive models able to forecast which datasets will become popular over time, thus discovering patterns and correlations useful to enhance the overall efficiency of the distributed infrastructure; the need of a novel caching policy based on the dataset popularity predictions than can outperform the current dataset replacement implementation.
The main contributions of this thesis include the following results:
1. we propose and implement a scalable machine learning pipeline, built on top of the CMS Hadoop data store, to predict the popularity of
new and existing datasets accessed by jobs processing any of the 25 event types stored in the distributed CMS infrastructure. We cast the problem of predicting dataset accesses to a binary classification problem where we are interested to forecast if a given dataset will be popular or not at time slot t in a given CMS site s, i.e., it will be accessed for more than x times during time slot t at site s. Our experiments show that the proposed predictive models reach very satisfying accuracy, indicating the ability to correctly separate popular datasets from unpopular ones.
2. We propose a novel intelligent data caching policy, named PPC (Popularity Prediction Caching). This caching strategy exploits the popularity predictions achieved with our best performing classifier to optimize the eviction policy implemented at each site of the CMS infrastructure. We assess the effectiveness of this caching policy by measuring the hit rates achieved by PPC and caching baselines such as LRU (Least Recently Used) in managing the dataset access requests over a two-year timespan at 6 CMS sites. The results of our simulation show that PPC outperforms LRU reducing the number of cache misses up to 20% in some sites.
File