logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-04042018-225330


Tipo di tesi
Tesi di laurea magistrale
Autore
LA PERNA, FRANCESCO
URN
etd-04042018-225330
Titolo
Data Mining techniques for consumer credit scoring prediction and anomaly detection
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA PER L'ECONOMIA E PER L'AZIENDA (BUSINESS INFORMATICS)
Relatori
relatore Prof.ssa Monreale, Anna
Parole chiave
  • anomaly detection
  • classification
  • consumer credit
  • data mining
  • scoring
Data inizio appello
27/04/2018
Consultabilità
Non consultabile
Data di rilascio
27/04/2088
Riassunto
The project, developed during a period of internship in the consulting company KPMG in Milan, aims at exploring data mining and machine learning techniques for the consumer-credit sector to integrate these techniques in the consolidate procedures of the hosting company and offer a full high-quality service to its clients. Besides providing a better knowledge and experience in the use of these modern techniques to the hosting company, the two main objectives proposed in this work are: identifying in advance consumers which will not be able to repay the debt acquired and identifying anomalous interest rates applied to credits. Since the dataset contains historical data of the credit institution, the client’s status is an information already available, hence the learning is supervised. For this reason, we implemented a classification task and tested four different classification algorithms. The second task instead is an unsupervised learning, since we do not have information about anomalies present in the dataset. We proceeded with an ensemble of unsupervised learning techniques. First, we made a clustering with k-means to identify groups of similar credits then, we followed two distinct approaches: the first one uses some cluster-based anomaly detection algorithms over the entire clustered dataset, the second one instead divides the dataset into as many subsets as the clusters found with k-means and, on each of them, applies a nearest-neighbors based anomaly detection algorithm. In this study we show that for the classification task all the classifiers obtain very positive results, probably due to an evident pattern in the data with respect to the default status. However the Gradient Boosted Trees (GBT) algorithm outperforms the others both as overall performance and, more important, as performance related to the minority class, which has very few examples in the dataset but which is crucial for the credit institution. For the anomaly detection task instead the second approach, in particular the k-NN Global Anomaly Score algorithm, provides the more comprehensible and satisfactory results.
File