ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-05092016-090250


Tipo di tesi
Tesi di dottorato di ricerca
Autore
ATZORI, CLAUDIO
URN
etd-05092016-090250
Titolo
gDup: an integrated and scalable graph deduplication system.
Settore scientifico disciplinare
ING-INF/05
Corso di studi
INGEGNERIA
Relatori
tutor Prof.ssa Bernardeschi, Cinzia
tutor Dott. Manghi, Paolo
Parole chiave
  • workflow
  • big data
  • entity resolution
  • deduplication
  • record linkage
  • graph
  • information space
Data inizio appello
07/06/2016
Consultabilità
Completa
Riassunto
In this thesis we start from the experiences and solutions for duplicate identification in Big Data collections and address the broader and more complex problem of 'Entity Deduplication over Big Graphs'. By 'Graph' we mean any digital representation of an Entity Relationship model, hence entity types (structured properties) and relationships between them. By 'Big' we mean that duplicate identification over the objects of such entity types cannot be handled with traditional backends and solutions, e.g .ranging from tens of millions of objects to any higher number. By 'entity deduplication' we mean the combined process of duplicate identification and graph disambiguation. Duplicate identification has the aim of efficiently identifying pairs of equivalent objects for the same entity type, while graph disambiguation has the goal of removing the duplication anomaly from the graph.
A large number of Big Graphs are today being maintained, e.g. collections populated over time with no duplicate controls, aggregations of multiple collections, which need continuous or extemporaneous entity deduplication cleaning. Examples are person deduplication in census records, deduplication of authors on library bibliographical collections (e.g. Google Scholar graph, Thomson Reuters citation graph, OpenAIRE graph), deduplication of catalogues from multiple stores, deduplication of Linked Open Data clouds resulting from integration of multiple clouds, any subset of the Web, etc.. As things stand today, data curators can find a plethora of tools supporting duplicate identification for Big collections of objects, which they can adopt to efficiently process the objects of individual entity type collections. However, the extension of such tools to the Big Data scenario is absent, as well as the support for graph disambiguation. In order to implement a full entity deduplication workflow for Big Graphs data curators end-up realizing patchwork systems, tailored to their graph data model, often bound to their physical representation of the graph (i.e. graph storage), expensive in terms of design, development, and maintenance, and in general not reusable by other practitioners with similar problems in different domains.

This first contribution of this thesis is a reference architecture for 'Big Graph Entity Deduplication Systems' (BGEDSs), which are integrated, scalable, general purpose systems for entity deduplication over Big Graphs. BGEDSs are intended to support data curators with the out-of-the-box functionalities they need to implement all phases of duplicates identification and graph disambiguation. The architecture formally defines the challenge, by providing graph type language and graph object language, defining the specifics of the entity deduplication phases, and explaining how such phases manipulate the initial graph to eventually return the final disambiguated graph. Most importantly, it defines the level of configuration, i.e. customization, that data curators should be able to exploit when relying on BGEDSs to implement entity deduplication.

The second contribution of this thesis is GDup, an implementation of a BGEDS whose instantiation is today used in the real production environment of the OpenAIRE infrastructure, the European e-infrastructure for Open Science and Access. GDup can be used to operate over Big Graphs represented using standards such as RDF-graphs or JSON-LD graphs and conforming to any graph schema. The system supports highly configurable duplicate identification and graph disambiguation settings, allowing data curators to tailor object matching functions by entity type properties and define the strategy of duplicate objects merging that will disambiguate the graph. GDup also provides functionalities to semi-automatically manage a Ground Truth, i.e. a set of trustworthy assertions of equality between objects, that can be used to preprocess objects of the same entity type and reduce computation time. The system is conceived to be extensible with other, possibly new methods in the deduplication domain (e.g. clustering functions, similarity functions) and supports scalability and performance over Big Graphs by exploiting an HBase - Hadoop MapReduce stack.
File