Tesi etd-06062024-102710

Tipo di tesi

Tesi di dottorato di ricerca

URN

etd-06062024-102710

Titolo

Enhancing Author Name Disambiguation Workflows in Big Data Scholarly Knowledge Graphs

Settore scientifico disciplinare

ING-INF/06 - BIOINGEGNERIA ELETTRONICA E INFORMATICA

Corso di studi

INGEGNERIA DELL'INFORMAZIONE

Relatori

tutor Prof. Avvenuti, Marco
tutor Dott. Falchi, Fabrizio
tutor Dott. Manghi, Paolo

Parole chiave

author name disambiguation
disambiguation
graph neural networks
scholarly communication graphs

Data inizio appello

10/06/2024

Consultabilità

Completa

Riassunto (Inglese)

Riassunto (Italiano)

Open Science, defined by its commitment to transparency, collaboration, openness, and accessibility, has deeply affected scientific research. Following this new paradigm, scientists produce and publish research data and software alongside research publications to enable reproducibility, monitoring, and assessment of science.
In this context, Scholarly Knowledge Graphs (SKGs) are “big data” metadata collections, playing a crucial role in research discovery and assessment by aggregating bibliographic metadata records and semantic relationships describing research products and their associations between them (e.g., citations, versions) and with other entities, such as organizations, authors, funders, etc. Examples of SKGs are the OpenAIRE Graph, Google Scholar, OpenAlex, Semantic Scholar, OpenCitations, and ResearchGraph.org. However, constructing and maintaining SKGs demands innovative solutions to address the inherent scalability, heterogeneity, duplication, inconsistency, and incompleteness challenges introduced by the metadata sources to be aggregated.
Motivated by the urge of Open Science and the challenges posed by SKG construction, this Ph.D. thesis makes pioneering contributions to the field of Author Name Disambiguation (AND). This perennial issue addresses the challenge of identifying and removing duplicate author nodes representing the same author in the SKG. Acknowledging the pivotal role of AND, the thesis discerns two main interwoven imperatives in the duplicate resolution processes: mitigating the efficiency challenge derived by the inherent quadratic complexity in comparing hundreds of millions of author nodes; and the effectiveness challenge introduced by the efficiency optimization strategies, which renounce parts of the matches, and affected by the poverty of metadata used to compare author nodes, which is often limited to the name’s string.
To address the efficiency challenge, the thesis introduces FDup, a groundbreaking framework meticulously designed to reimagine and enhance the traditional disambiguation workflow. At its core, FDup prioritizes the optimization of the similarity match phase. This optimization is achieved through the incorporation of a decision tree-based comparison technique. This innovative approach ensures a customizable and efficient disambiguation workflow and enables parallelization, a crucial aspect in handling the substantial datasets inherent in Scholarly Knowledge Graphs.
To address the effectiveness challenge, the thesis leverages Graph Neural Networks (GNNs), which have been recently successfully applied to perform innovative research on node classification, graph classification, and link prediction. The proposed contributions manifest in two dedicated GNN architectures to enhance the effectiveness of Author Name Disambiguation via an evaluation of the outputs of a disambiguation algorithm: the first technique evaluates similarity relationships with an attentive neural network integrating GraphSAGE models; the second technique evaluates groups of duplicates with a combination of Graph Attention Network (GAT) and Long Short Term Memory (LSTM) components.
In summary, this thesis is a responsive and forward-thinking contribution within the landscape of Open Science and Scholarly Knowledge Graphs. By introducing novel frameworks and harnessing advanced techniques like Graph Neural Networks, the thesis not only addresses the current challenges but also lays the groundwork for the continual evolution of Open Science practices and the optimal utilization of Scholarly Knowledge Graphs in the ever-expanding realm of scientific knowledge.

File

Nome file	Dimensione
DeBonisPhDThesis.pdf	6.42 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-06062024-102710