Tesi etd-04222010-151154

Tipo di tesi

Tesi di dottorato di ricerca

URN

etd-04222010-151154

Titolo

A hybrid approach for semantic relation extraction in ontology learning from text

Settore scientifico disciplinare

INF/01 - INFORMATICA

Corso di studi

INGEGNERIA DELL'INFORMAZIONE

Relatori

tutor Prof. Simoncini, Luca
relatore Dott.ssa Montemagni, Simonetta

Parole chiave

natural language processing
ontologies
ontology learning from text
semantic annotation
semantic relation extraction

Data inizio appello

03/06/2010

Consultabilità

Completa

Riassunto (Inglese)

Riassunto (Italiano)

In this thesis we propose an unsupervised system for semantic relation extraction from texts. The automatic extraction of semantic relationships is crucial both in ontology learning from text and for semantic annotation and represents a solution to the "knowledge acquisition bottleneck" in the context of the Semantic Web.
The developed system, assessed on English and Italian language but applicable to any other languages, takes as input pairs of words and determines whether there is a semantic relationship between these words. The initial pairs of terms are extracted from a "Target Corpus" by an unsupervised statistical system in charge of determining if two terms can be considered "distributionally similar", on the assumption of distributional semantics that "the meaning of a word is strongly related to the contexts in which it appears."
To verify that there is actually a semantic relation between two terms and determine its nature, the system searches for words on a "Support Corpus" (the Web) in the context of lexico-syntactic "reliable" (low "recall" but "high precision") patterns, where these words appear in the same sentence (as, for example, the words "steer" and "car" in the phrase "the steer is part of the car").
This thesis describes the overall process that led to the development of the RelEx system, starting from the definition and application of the lexico-syntactic patterns, and including the measures used to assess the reliability of specific semantic relations that the system suggests. The work focuses on the semantic relations of hyponymy ("is_a"), meronymy ("part_of") and co-hyponymy (i.e. two terms are hyponyms of the same term, as "lion" and "tiger" with respect to "feline"). The approach may however be extended to extract other relationships by changing the battery of reliable patterns used.
The precision of the system was evaluated as 83.3% for hyponymy, 75% for meronymy and 72.2% for co-hyponymy, demonstrating the validity of the proposed approach.
In this work, in addition to the novel concepts of "Closed Pattern" and "Open Pattern", two new technologies are described. The first methodology, called "trans-language boosting" is devoted to the application of reliable patterns and pairs of terms expressed in different languages with the aim of increasing the performance of the system. The second technique, defined as "cross-reference near-synonymy extraction", is based on the application of "open" patterns for the recognition of near-synonymy relations.

File

Nome file	Dimensione
Tesi_PhD_final.pdf	812.75 Kb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-04222010-151154