logo SBA


Digital archive of theses discussed at the University of Pisa


Thesis etd-12232016-151401

Thesis type
Tesi di dottorato di ricerca
Thesis title
Data Flow Quality Monitoring in Data Infrastructures
Academic discipline
Course of study
tutor Prof. Avvenuti, Marco
tutor Dott. Manghi, Paolo
  • data flow
  • data infrastructure
  • data quality
  • metrics
  • monitoring
  • workflow
Graduation session start date
In the last decade, a lot of attention worldwide has been brought by researchers, organizations, and funders on the realization ofData Infrastructures (DIs), namely systems supporting researchers with the broad spectrum of resources they need to perform science. DIs are here intended as ICT (eco)systems offering data and processing components which can be combined into data flows so as to enable arbitrarily complex data manipulation actions serving the consumption needs of DI customers, be them humans or machines.Data resulting from the execution of data flows, represent an important asset both for the DI users, typically craving for the information they need, and for the organization (or community) operating the DI, whose existence and cost sustainability depends on the adoption and usefulness of the DI. On the other hand, when operating several data processing data flows over time, several issues, well-known to practitioners, may arise and compromise the behaviour of the DI, and therefore undermine its reliability and generate stakeholders dissatisfaction. Such issues span a plethora of causes, such as(i) the lack of any kind of guarantees (e.g. quality, stability, findability, etc.) from integrated external data sources, typically not under the jurisdiction of the DI; (ii) the occurrence at any abstraction level of subtle, unexpected errors in the data flows; and(iii) the nature in ever changing evolution of the DI, in terms of data flow composition and algorithms/configurations in use.The autonomy of DI components, their use across several data flows, the evolution of end-user requirements over time, make the one of DI data flows a critical environment, subject to the most subtle inconsistencies. Accordingly, DI users demand guarantees, while quality managers are called to provide them, on the “correctness” of the DI data flows behaviour over time, to be somehow quantified in terms of “data quality” and in terms of “processing quality”. Monitoring the quality of data flows is therefore a key activity of paramount importance to ensure the up-taking and long term existence of a DI. Indeed, monitoring can detect or anticipate misbehaviours of DI’s data flows, in order to prevent and adjust the errors, or at least “formally” justify to the stakeholders the underlying reasons, possibly not due to the DI, of such errors. Not only, monitoring can also be vital for DIs operation, as having hardware and software resources actively employed in processing low quality data can yield inefficient resource allocation and waste of time.However, data flow quality monitoring is further hindered by the “hybrid” nature of such infrastructures, which typically consist of a patchwork of individual components(“system of systems”) possibly developed by distinct stakeholders with possibly distinct life-cycles, evolving over time, whose interactions are regulated mainly by shared policies agreed at infrastructural level. Due to such heterogeneity, generally DIs are not equipped with built-in monitoring systems in this sense and to date DI quality managers are therefore bound to use combinations of existing tools – with non trivial integration efforts – or to develop and integrate ex-post their own ad-hoc solutions, at high cost of realization and maintenance.In this thesis, we introduce MoniQ, a general-purpose Data Flow Quality Monitoring system enabling the monitoring of critical data flow components, which are routinely checked during and after every run of the data flow against a set of user-defined quality control rules to make sure the data flow meets the expected behaviour and quality criteria over time, as established upfront by the quality manager. MoniQ introduces a monitoring description language capable of (i) describing the semantic and the time ordering of the observational intents and capture the essence of the DI data flows to be monitored; and (ii) describing monitoring intents over the monitoring flows in terms of metrics to be extracted and controls to be ensured. The novelty of the language is that it incorporates the essence of existing data quality monitoring approaches, identifies and captures process monitoring scenarios, and, above all, provides abstractions to represent monitoring scenarios that combine data and process quality monitoring within the scope of a data flow. The study is provided with an extensive analysis of two real-world use cases used as support and validation of the proposed approach, and discusses an implementation of MoniQ providing quality managers with high-level tools to integrate the solution in a DI in an easy, technology transparent and cost efficient way in order to start to get insight out data flows by visualizing the trends of the metrics defined and the outcome of the controls declared against them.