logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-06062024-102710


Tipo di tesi
Tesi di dottorato di ricerca
Autore
DE BONIS, MICHELE
URN
etd-06062024-102710
Titolo
Enhancing Author Name Disambiguation Workflows in Big Data Scholarly Knowledge Graphs
Settore scientifico disciplinare
ING-INF/06
Corso di studi
INGEGNERIA DELL'INFORMAZIONE
Relatori
tutor Prof. Avvenuti, Marco
tutor Dott. Falchi, Fabrizio
tutor Dott. Manghi, Paolo
Parole chiave
  • author name disambiguation
  • disambiguation
  • graph neural networks
  • scholarly communication graphs
Data inizio appello
10/06/2024
Consultabilità
Completa
Riassunto
Open Science, defined by its commitment to transparency, collaboration, openness, and accessibility, has deeply affected scientific research. Following this new paradigm, scientists produce and publish research data and software alongside research publications to enable reproducibility, monitoring, and assessment of science.
In this context, Scholarly Knowledge Graphs (SKGs) are “big data” metadata collections, playing a crucial role in research discovery and assessment by aggregating bibliographic metadata records and semantic relationships describing research products and their associations between them (e.g., citations, versions) and with other entities, such as organizations, authors, funders, etc. Examples of SKGs are the OpenAIRE Graph, Google Scholar, OpenAlex, Semantic Scholar, OpenCitations, and ResearchGraph.org. However, constructing and maintaining SKGs demands innovative solutions to address the inherent scalability, heterogeneity, duplication, inconsistency, and incompleteness challenges introduced by the metadata sources to be aggregated.
Motivated by the urge of Open Science and the challenges posed by SKG construction, this Ph.D. thesis makes pioneering contributions to the field of Author Name Disambiguation (AND). This perennial issue addresses the challenge of identifying and removing duplicate author nodes representing the same author in the SKG. Acknowledging the pivotal role of AND, the thesis discerns two main interwoven imperatives in the duplicate resolution processes: mitigating the efficiency challenge derived by the inherent quadratic complexity in comparing hundreds of millions of author nodes; and the effectiveness challenge introduced by the efficiency optimization strategies, which renounce parts of the matches, and affected by the poverty of metadata used to compare author nodes, which is often limited to the name’s string.
To address the efficiency challenge, the thesis introduces FDup, a groundbreaking framework meticulously designed to reimagine and enhance the traditional disambiguation workflow. At its core, FDup prioritizes the optimization of the similarity match phase. This optimization is achieved through the incorporation of a decision tree-based comparison technique. This innovative approach ensures a customizable and efficient disambiguation workflow and enables parallelization, a crucial aspect in handling the substantial datasets inherent in Scholarly Knowledge Graphs.
To address the effectiveness challenge, the thesis leverages Graph Neural Networks (GNNs), which have been recently successfully applied to perform innovative research on node classification, graph classification, and link prediction. The proposed contributions manifest in two dedicated GNN architectures to enhance the effectiveness of Author Name Disambiguation via an evaluation of the outputs of a disambiguation algorithm: the first technique evaluates similarity relationships with an attentive neural network integrating GraphSAGE models; the second technique evaluates groups of duplicates with a combination of Graph Attention Network (GAT) and Long Short Term Memory (LSTM) components.
In summary, this thesis is a responsive and forward-thinking contribution within the landscape of Open Science and Scholarly Knowledge Graphs. By introducing novel frameworks and harnessing advanced techniques like Graph Neural Networks, the thesis not only addresses the current challenges but also lays the groundwork for the continual evolution of Open Science practices and the optimal utilization of Scholarly Knowledge Graphs in the ever-expanding realm of scientific knowledge.
File