logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-11152021-233153


Tipo di tesi
Tesi di laurea magistrale
Autore
ELYANI, TIA
URN
etd-11152021-233153
Titolo
Identification of Sensitive Information Using AI/ML In Structured Dataset
Dipartimento
INFORMATICA
Corso di studi
DATA SCIENCE AND BUSINESS INFORMATICS
Relatori
relatore Prof. Monreale, Anna
Parole chiave
  • GDPR
  • structured dataset
  • graph convolutional network
  • name detection
  • personal data
Data inizio appello
03/12/2021
Consultabilità
Non consultabile
Data di rilascio
03/12/2091
Riassunto
Protection of personal and sensitive information is becoming a mandatory nowadays especially since the advent of the EU's General Data Protection Regulation (GDPR). This process requires compliant processing of personal data. When dealing with large datasets of structured data, this is a challenge that often requires significant manual activities. In this thesis, we propose a methodology to detect and classify personal data in in relational databases, using Graph Convolutional Network. After defining a robust approach for the detection of database features containing personal names, we use this knowledge to construct a graph that follows the structure of the relational database. The graph nodes model database features as for example, their distance from the closest personal name feature. We implemented Node Classification and Link Prediction to discover the relationship among database attributes. For personal name detection, we designed an approach that combines reference dataset, Named Entity Recognition classification results, queries on Wikipedia knowledge-base, together with an ad-hoc classification methodology based on DBScan clustering on fastText word embeddings. Our model very good results and high accuracy on names from many different countries and languages in the World, tested on Western and Eastern news dataset. With respect to the personal information detection approach on structured data, we carried out experiments on simple databases and complex databases. We found that structured graphs and features in nodes have an important role in the learning process using Graph Convolutional Network. This work is an initial approach and provides direction for further research.
File