logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-11272013-101901


Tipo di tesi
Tesi di laurea specialistica
Autore
BOMBARA, GIUSEPPE
URN
etd-11272013-101901
Titolo
Designing and evaluating one-class classifiers using unlabeled data
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
INGEGNERIA DELLA AUTOMAZIONE
Relatori
relatore Prof. Zini, Giancarlo
relatore Prof. Cococcioni, Marco
relatore Prof.ssa Lazzerini, Beatrice
Parole chiave
  • anomaly detection
  • cross-validation
  • data description
  • fault detection
  • one-class classifier (OCC)
  • outlier detection
  • performance evaluation
  • performance evaluation
  • positive and unlabeled data
  • Precision-Recall (PR)
  • Receiver operating characteristic (ROC)
  • target and unlabeled data
Data inizio appello
13/12/2013
Consultabilità
Completa
Riassunto
This work focuses on OCCs, a powerful tool introduced in the last decade to solve detection problems. Differently from classical two-class classifiers, where target and outlier objects are supposed to be available for training and testing, in OCCs only the target class is modeled, i.e., learned from target data.
This avoids the obligation of collecting a statistically reliable set of outlier conditions, thus saving time and, often, money. Unfortunately, outliers are still required by the current state-of-the-art during the performance assessment phase.
When a set of outliers is not available, in literature it has been proposed to generate artificial outliers to enable the performance assessment. However, this approach makes strong and often unjustified assumptions about the outlier class distribution.

This work introduces a novel performance assessment method that does not require outliers during the evaluation phase, thus leading to a pure and elegant outlier-free approach. This remarkable result has been achieved by assuming that both a targets and an unlabeled set are available. The advantage of requiring an unlabeled set instead of an outlier set is that collecting unlabeled data is far more easy and cheap than collecting a reliable set of measurements related to outlier conditions, which typically requires the intervention of a human expert and the disruption of the system under analysis.

We have validated the novel performance assessment method on three datasets: an artificial dataset with low input dimensionality, a real dataset widely used as benchmark for machine learning algorithms, and a real dataset related to a fault detection problem of a submersible pump. Our experiments prove the validity of the new method and also highlight the potential dangers of using artificially generated outliers, especially for datasets in high dimension.
File