Tesi etd-11272013-101901

Tipo di tesi

Tesi di laurea specialistica

Autore

BOMBARA, GIUSEPPE

URN

etd-11272013-101901

Titolo

Designing and evaluating one-class classifiers using unlabeled data

Dipartimento

INGEGNERIA DELL'INFORMAZIONE

Corso di studi

INGEGNERIA DELLA AUTOMAZIONE

Relatori

relatore Prof. Zini, Giancarlo
relatore Prof. Cococcioni, Marco
relatore Prof.ssa Lazzerini, Beatrice

Parole chiave

anomaly detection
cross-validation
data description
fault detection
one-class classifier (OCC)
outlier detection
performance evaluation
performance evaluation
positive and unlabeled data
Precision-Recall (PR)
Receiver operating characteristic (ROC)
target and unlabeled data

Data inizio appello

13/12/2013

Consultabilità

Completa

Riassunto

This work focuses on OCCs, a powerful tool introduced in the last decade to solve detection problems. Differently from classical two-class classifiers, where target and outlier objects are supposed to be available for training and testing, in OCCs only the target class is modeled, i.e., learned from target data.
This avoids the obligation of collecting a statistically reliable set of outlier conditions, thus saving time and, often, money. Unfortunately, outliers are still required by the current state-of-the-art during the performance assessment phase.
When a set of outliers is not available, in literature it has been proposed to generate artificial outliers to enable the performance assessment. However, this approach makes strong and often unjustified assumptions about the outlier class distribution.

This work introduces a novel performance assessment method that does not require outliers during the evaluation phase, thus leading to a pure and elegant outlier-free approach. This remarkable result has been achieved by assuming that both a targets and an unlabeled set are available. The advantage of requiring an unlabeled set instead of an outlier set is that collecting unlabeled data is far more easy and cheap than collecting a reliable set of measurements related to outlier conditions, which typically requires the intervention of a human expert and the disruption of the system under analysis.

We have validated the novel performance assessment method on three datasets: an artificial dataset with low input dimensionality, a real dataset widely used as benchmark for machine learning algorithms, and a real dataset related to a fault detection problem of a submersible pump. Our experiments prove the validity of the new method and also highlight the potential dangers of using artificially generated outliers, especially for datasets in high dimension.

File

Nome file	Dimensione
Bombara_MSThesis.pdf	14.29 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-11272013-101901