ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-03252019-110144


Tipo di tesi
Tesi di laurea magistrale
Autore
EMANUELE, MARCO
URN
etd-03252019-110144
Titolo
Extraction of Technical Information from Unusual Sources
Dipartimento
INGEGNERIA DELL'ENERGIA, DEI SISTEMI, DEL TERRITORIO E DELLE COSTRUZIONI
Corso di studi
INGEGNERIA GESTIONALE
Relatori
relatore Prof. Fantoni, Gualtiero
correlatore Dott. Chiarello, Filippo
Parole chiave
  • POS Tagging
  • Text mining
  • Taxonomy
  • Soft Skills
  • Information Extraction
  • Tool
  • Database
  • Natural Language Processing
  • Keywords
  • Regular Expressions
Data inizio appello
02/05/2019
Consultabilità
Non consultabile
Data di rilascio
02/05/2089
Riassunto
Nowadays, society finds itself in the so called “Information Age”, in which the combination of exponential growth of computer capability enabled connectivity among compatible devices and, moreover, resulted in a mass proliferation of data. Connectivity and data have the ability of forging the way people live and work; for these reasons, organizations are starting to undertake actions for collecting, managing, representing, storing and securing all useful data. As mentioned above, a huge amount of information is now available: this is not only accessible in a structured and organized form, but even in an unstructured form such as document and texts. So, the key activity which is being pursued from firms is the comprehension of such amount of knowledge; this task would be a problem with a non-automatic approach. Text Mining comes in help to handle this unstructured nature. Text Mining is the process of extracting valuable and high-quality information from documents. This thesis uses Information Extraction techniques, which have the objective to find structured information from unstructured data and it is a narrower field than Text Mining. The process of Information Extraction (IE) is defined as the automated retrieval of certain information related to a specific topic from one or more bodies of text, and a brief description of its tasks involves collecting, elaborating and representing data in order to expose corresponding results.
The methodology outlined in this thesis is based on the advantages offered by Information Extraction tools, which make online searching more reliable thanks to the collection of a large number of data which are then automatically compared between them; furthermore, this also reduces the partiality of the results. The approach is different from the task that would involve domain experts, such as labor market experts in this case, for the identification of a list of sources. Even if the consultation just described could bring a high confidence result, it is equally much more cost-intensive and time-consuming than the automatic elaboration through Information Extraction tools of such documents.
This information involves not only productive processes, but even strategic decision task, such as Human Resources selection. The digitalization of HR Management has facilitated communication between recruiters and job seekers; jobs can now be accessed by candidates while recruiters can access their online profiles. This new way of sourcing suitable candidates, enables recruiters to concentrate more on the interview, which is the most important recruitment phase and it is still based on human interaction. Besides, the job market is represented by digital information which can be used for large scale computing, and in particular it has opened the way for a computer assisted recruitment process.
The purpose of this thesis is twofold: on the one hand it consists of the researching of the Unusual Sources which contain Technical Information such as knowledge and abilities sought by different firms from employees and, on the other hand, the focus is on the researching of Hard and Soft Skills present in the job applications selected websites. For the first task it was used Scopus Elsevier database in order to search accessible, free and congruent web sources, while for the second one an activity for collecting job resumes was followed by an evaluation of the richness of the occurrence of desired keywords on each website. Furthermore, the thesis work shows the approach for finding this needed Technical Information through the linguistic awareness of word dependencies and relations between each other, defining hyponyms, hypernyms and constructing a taxonomy of terms.
File