Tesi etd-04222010-151154 |
Link copiato negli appunti
Tipo di tesi
Tesi di dottorato di ricerca
Autore
GIOVANNETTI, EMILIANO
URN
etd-04222010-151154
Titolo
A hybrid approach for semantic relation extraction in ontology learning from text
Settore scientifico disciplinare
INF/01
Corso di studi
INGEGNERIA DELL'INFORMAZIONE
Relatori
tutor Prof. Simoncini, Luca
relatore Dott.ssa Montemagni, Simonetta
relatore Dott.ssa Montemagni, Simonetta
Parole chiave
- natural language processing
- ontologies
- ontology learning from text
- semantic annotation
- semantic relation extraction
Data inizio appello
03/06/2010
Consultabilità
Completa
Riassunto
In this thesis we propose an unsupervised system for semantic relation extraction from texts. The automatic extraction of semantic relationships is crucial both in ontology learning from text and for semantic annotation and represents a solution to the "knowledge acquisition bottleneck" in the context of the Semantic Web.
The developed system, assessed on English and Italian language but applicable to any other languages, takes as input pairs of words and determines whether there is a semantic relationship between these words. The initial pairs of terms are extracted from a "Target Corpus" by an unsupervised statistical system in charge of determining if two terms can be considered "distributionally similar", on the assumption of distributional semantics that "the meaning of a word is strongly related to the contexts in which it appears."
To verify that there is actually a semantic relation between two terms and determine its nature, the system searches for words on a "Support Corpus" (the Web) in the context of lexico-syntactic "reliable" (low "recall" but "high precision") patterns, where these words appear in the same sentence (as, for example, the words "steer" and "car" in the phrase "the steer is part of the car").
This thesis describes the overall process that led to the development of the RelEx system, starting from the definition and application of the lexico-syntactic patterns, and including the measures used to assess the reliability of specific semantic relations that the system suggests. The work focuses on the semantic relations of hyponymy ("is_a"), meronymy ("part_of") and co-hyponymy (i.e. two terms are hyponyms of the same term, as "lion" and "tiger" with respect to "feline"). The approach may however be extended to extract other relationships by changing the battery of reliable patterns used.
The precision of the system was evaluated as 83.3% for hyponymy, 75% for meronymy and 72.2% for co-hyponymy, demonstrating the validity of the proposed approach.
In this work, in addition to the novel concepts of "Closed Pattern" and "Open Pattern", two new technologies are described. The first methodology, called "trans-language boosting" is devoted to the application of reliable patterns and pairs of terms expressed in different languages with the aim of increasing the performance of the system. The second technique, defined as "cross-reference near-synonymy extraction", is based on the application of "open" patterns for the recognition of near-synonymy relations.
The developed system, assessed on English and Italian language but applicable to any other languages, takes as input pairs of words and determines whether there is a semantic relationship between these words. The initial pairs of terms are extracted from a "Target Corpus" by an unsupervised statistical system in charge of determining if two terms can be considered "distributionally similar", on the assumption of distributional semantics that "the meaning of a word is strongly related to the contexts in which it appears."
To verify that there is actually a semantic relation between two terms and determine its nature, the system searches for words on a "Support Corpus" (the Web) in the context of lexico-syntactic "reliable" (low "recall" but "high precision") patterns, where these words appear in the same sentence (as, for example, the words "steer" and "car" in the phrase "the steer is part of the car").
This thesis describes the overall process that led to the development of the RelEx system, starting from the definition and application of the lexico-syntactic patterns, and including the measures used to assess the reliability of specific semantic relations that the system suggests. The work focuses on the semantic relations of hyponymy ("is_a"), meronymy ("part_of") and co-hyponymy (i.e. two terms are hyponyms of the same term, as "lion" and "tiger" with respect to "feline"). The approach may however be extended to extract other relationships by changing the battery of reliable patterns used.
The precision of the system was evaluated as 83.3% for hyponymy, 75% for meronymy and 72.2% for co-hyponymy, demonstrating the validity of the proposed approach.
In this work, in addition to the novel concepts of "Closed Pattern" and "Open Pattern", two new technologies are described. The first methodology, called "trans-language boosting" is devoted to the application of reliable patterns and pairs of terms expressed in different languages with the aim of increasing the performance of the system. The second technique, defined as "cross-reference near-synonymy extraction", is based on the application of "open" patterns for the recognition of near-synonymy relations.
File
Nome file | Dimensione |
---|---|
Tesi_PhD_final.pdf | 812.75 Kb |
Contatta l’autore |