logo SBA


Digital archive of theses discussed at the University of Pisa


Thesis etd-10122017-195900

Thesis type
Tesi di dottorato di ricerca
Thesis title
Segregation aware data mining
Academic discipline
Course of study
tutor Prof. Ruggieri, Salvatore
  • data mining
  • segregation
  • social networks
  • social science
Graduation session start date
This thesis tackles the social segregation problem from a data science perspective by proposing a segregation-aware data mining framework for the discovery of segregation from relational data and from attributed graphs. The approach is implemented in an efficient system and experimented on two challenging case studies in the do-
main of occupational segregation in company boards.
The framework builds on quantitative measures of segregation, called segregation indexes, proposed in the social science literature. The segregation discovery problem is first introduced for relational data. It consists of searching sub-groups of popu lation and minorities for which a segregation index is above a minimum threshold.
A search algorithm is devised that solves the segregation problem by computing a multi-dimensional data cube that can be explored by the analyst. The machinery underlying the search algorithm relies on frequent itemset mining tools.
The approach is then extended to graph data consisting of bipartite attribute graphs, which model real networks by enriching their nodes with attribute values.
Segregation indexes assume a partition of the population into organizational units (e.g., schools, neighborhoods, etc.), which are not obvious for graphs. We propose a fast and scalable algorithm for partitioning large attributed graphs. The approach does not require the user to guess in advance the number of clusters. Experimental results demonstrate its ability to efficiently compute high-quality partitions.
Our implementation of the framework, called SCube, supports an analyst in discovering context of social segregation. Users of the system include social scientists, policy decision makers in socially sensitive fields (urban development, public transportation and services, medical and health managers, etc.), and control authorities.
The system is developed in Java 8, hence portable, and thanks to state-of-the-art libraries achieve good performances on large datasets. We demonstrate the applicability of the proposed methodology and tools in a
complex scenario, reflecting the risks of modern segregation in occupational social networks. The scenario considers glass-ceiling barriers for women in accessing boards of company directors. Two case studies are presented, one considering Italian companies and the other Estonian companies. The latter case incluse temporal information, thus allowing for temporal analysis of segregation.