logo SBA


Digital archive of theses discussed at the University of Pisa


Thesis etd-01122017-160211

Thesis type
Tesi di dottorato di ricerca
Thesis title
NLP based Information Extraction methods for Patent Analysis
Academic discipline
Course of study
tutor Prof. Marcelloni, Francesco
tutor Prof. Dell'Orletta, Felice
  • Information Extraction
  • Marketing
  • Natural Language Processing
  • Patent analysis
Graduation session start date
The focus of this thesis is the analysis of patents through NLP--based extraction systems. State-of-the-art systems for automatic patent analysis are designed for engineers and attorneys and they usually do not take into account that there is a variety of patent readers which are becoming more and more interested in this topic, such as marketers and designers. This new audience is interested in automatic patent analysis since patents contain relevant information that anticipates the availability of products on the market. Managing such information can help them to identify new market trends and define successful strategies.

The main novelty of this work is that the entire information extraction pipeline has been designed to extract relevant information for this new audience. This work focuses on the extraction of users that will possibly benefit from an invention, advantages that an invention brings or drawbacks that an innovation solves.

The extraction problem is addressed by adapting existing tools originally designed to extract information from general--purpose texts.

The adaptation process introduces important novelties. First, it is illustrated a semi-automatic method for the development of a domain specific training set to extract the relevant entities allowing to minimize the human annotation effort.

Secondly, several learning algorithms and feature configurations were tested to improve the overall accuracy of the information extraction process.

Finally, it has been tested a method that combines the information extracted from patents and the analysis of social media text specifically conceived to extract advantages and drawbacks. This method relies on sentiment analysis of text extracted of social media under the assumption that terms indicating advantages should be generally positively perceived by people, the contrary for drawbacks.