Designing and evaluating one-class classifiers using unlabeled data
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
INGEGNERIA DELLA AUTOMAZIONE
Relatori
relatore Prof. Zini, Giancarlo relatore Prof. Cococcioni, Marco relatore Prof.ssa Lazzerini, Beatrice
Parole chiave
anomaly detection
cross-validation
data description
fault detection
one-class classifier (OCC)
outlier detection
performance evaluation
performance evaluation
positive and unlabeled data
Precision-Recall (PR)
Receiver operating characteristic (ROC)
target and unlabeled data
Data inizio appello
13/12/2013
Consultabilità
Completa
Riassunto
This work focuses on OCCs, a powerful tool introduced in the last decade to solve detection problems. Differently from classical two-class classifiers, where target and outlier objects are supposed to be available for training and testing, in OCCs only the target class is modeled, i.e., learned from target data. This avoids the obligation of collecting a statistically reliable set of outlier conditions, thus saving time and, often, money. Unfortunately, outliers are still required by the current state-of-the-art during the performance assessment phase. When a set of outliers is not available, in literature it has been proposed to generate artificial outliers to enable the performance assessment. However, this approach makes strong and often unjustified assumptions about the outlier class distribution.
This work introduces a novel performance assessment method that does not require outliers during the evaluation phase, thus leading to a pure and elegant outlier-free approach. This remarkable result has been achieved by assuming that both a targets and an unlabeled set are available. The advantage of requiring an unlabeled set instead of an outlier set is that collecting unlabeled data is far more easy and cheap than collecting a reliable set of measurements related to outlier conditions, which typically requires the intervention of a human expert and the disruption of the system under analysis.
We have validated the novel performance assessment method on three datasets: an artificial dataset with low input dimensionality, a real dataset widely used as benchmark for machine learning algorithms, and a real dataset related to a fault detection problem of a submersible pump. Our experiments prove the validity of the new method and also highlight the potential dangers of using artificially generated outliers, especially for datasets in high dimension.