Digital archive of theses discussed at the University of Pisa


Thesis etd-10172017-122156

Thesis type
Tesi di dottorato di ricerca
Thesis title
Privacy Risk Assessment in Big Data Analytics and User-Centric Data Ecosystems
Academic discipline
Course of study
tutor Giannotti, Fosca
relatore Pedreschi, Dino
  • privacy
  • risk assessment
  • big data
  • analytics
Graduation session start date
Nowadays, our daily life is centered on data. Whether or not we are aware of it, our simple everyday interactions with through digital devices produce a myriad of data, that is combined to create Big Data. We leave traces relating to our movements via our mobile phones and GPS devices, to our relationships within social networks, to our habits and tastes from query logs and records of what we buy. These digital breadcrumbs are a treasure trove as a way to discover new patterns in human activities and a way to understand better many aspects of human behavior that it was impossible to study or analyze just a few years ago. The resulting data can also enable a totally new class of services that can improve directly and sensibly our society or provide ways to tackle and solve problems from new perspectives. The other side of the coin is the question of privacy: since the data describe our life at a very detailed level, privacy breaches can occur along with inferences that reveal the most personal details. For example, a malicious party could uncover our home location from GPS tracks, our lovelife from call records or communication in social networks and our health status from the products that we buy in a supermarket. For this reason, we are witnessing changes in ethical and legal norms, with a move towards a novel vision of the data management, which focuses on giving appropriate priority to privacy and individuals.
The objective of this thesis is two-fold. Firstly, we propose a framework that aims to enable a privacy-aware data sharing ecosystem, based on Privacy-by-Design. This framework, called PRISQUIT (Privacy RISk versus QUalITy), can support a Data Provider in sharing collected personal data with an external entity, e.g., a Service Developer. PRISQUIT helps to decide which is the right level of aggregation of the data and what are the opportune strategies for enforcing privacy, by quantifying the actual and empirical privacy risk of the individuals, highlighting the users most at risk, and consequently the data related to them. Then it analyzes the data quality which guarantees only the data from users not at risk is released. The framework is modular, so it is possible to define, implement and enrich the framework management with new kinds of data, new privacy risk and utility functions, potential new types of background knowledge, new services to be developed and new mitigation strategies.
Secondly, we investigate the privacy perspective within a user-centric model, where each individual has full control of the life cycle of his personal data. To this end, we take advantage of the outcome of PRISQUIT by studying the correlation between some individual features, such as entropy of visited locations, and the actual privacy risk. Then we design a method that allows each user to obtain an estimated level of his own privacy risk. This tool leads to increased awareness about individual personal data and, thus, it helps people in choosing whether or not to share their data with third parties. After that, we propose three privacy-preserving transformations based on the differential privacy paradigm, which offers very strong privacy guarantees regardless of any external knowledge that a malicious agent has. This can render the data private before they leave the individual who produces them.
We provide a wide range of experiments on three kinds of real world data (mainly mobility data, but also mobile phones and retail data), to prove the flexibility and the utility of the PRISQUIT framework and the usefulness of the two approaches related to the user-centric ecosystem.