Outlier Selection and One-Class Classification
Ph.D. thesis abstract
Jeroen Janssens
Promotores: Prof.dr. E.O. Postma, Prof.dr. H.J. van den Herik
Date of defense: June 11, 2013
An anomaly is a real-world observation or event that deviates from what is considered to be normal. Examples of anomalies are: a terrorist attack, a forged painting, and a rotten apple. Because anomalies can be dangerous or expensive, detecting them is of utmost importance. A domain expert may suffer from three cognitive limitations that hamper the detection of anomalies: (1) fatigue, (2) information overload, and (3) emotional bias.
Real-world observations can often be measured and recorded to form a data set. A suitable representation translates anomalies into outliers, which are data points that deviate quantitatively from the majority of other data points. Outlier-selection algorithms are capable of automatically selecting outliers from large data sets. In the thesis we study the problem statement: To what extent can outlier-selection algorithms support domain experts with real-world anomaly detection?
We consider a domain expert to be supported when the following three conditions are satisfied. First, a domain expert should know how to evaluate and compare the performance of the outlier-selection algorithms. Second, a domain expert should have one or more algorithms available that are considered state-of-the-art. Third, a domain expert should know when to apply which algorithm. The three conditions lead to three accompanying research questions: (RQ1) How should we evaluate and compare the performance of outlier-selection algorithms?, (RQ2) To what extent can the concept of affinities be applied to outlier-selection algorithms?, and (RQ3) To what extent can we solve the one-class classifier selection problem using a meta-learning approach? The answers to these three research questions enable us to formulate an answer to the problem statement.
In Chapter 2 we introduce the main concepts and explain the experimental set-up that is employed for answering RQ1, RQ2, and RQ3. This is partly based on a review and the analysis of the relevant literature concerning anomaly detection, outlier selection, and machine learning. The chapter consists of four parts. First, we present two outlier-selection settings that we consider in the thesis: (1) unsupervised outlier selection and (2) one-class classification. We highlight their similarities and differences, and we make clear when which setting is appropriate. Second, we describe two types of data representations that outlier-selection algorithms can process: (1) feature vector representation and (2) similarity matrix representation. Third, we explain in detail the experimental set-up. It relies on statistical techniques such as (1) the Area Under the ROC Curve (AUC) performance measure, (2) the post-hoc Neményi statistical test, and (3) cross validation. Fourth, we explain how to transform ordinary multi-class data sets into one-class data sets so that they can be used for our comparative experiments.
In Chapter 3 we answer RQ1. We describe how anomalies are detected in the fields of Machine Learning (ML) and Knowledge Discovery in Databases (KDD). Both fields have their own anomaly-detection algorithms and corresponding evaluation procedures. Two well-known outlier-selection algorithms from the field of KDD are framed into the one-class classification framework so that these can be compared with three algorithms from the field of ML in a statistically valid way. We perform a comparative evaluation of the ML and KDD outlier-selection algorithms on real-world datasets and discuss the results.
In Chapter 4, we provide an answer to RQ2. We present a novel, unsupervised algorithm
for selecting outliers, called Stochastic Outlier Selection (SOS). SOS uses the concept of affinity to compute for each data point an outlier probability. The probabilities that are computed by SOS provide several advantages with respect to the unbounded outlier scores as computed by related algorithms. Using outlier-score plots, we illustrate and discuss the qualitative performances of SOS and four related algorithms. Then we evaluate SOS and the four algorithms on eighteen real-world data sets and seven synthetic data sets. The results obtained on these 25 data sets show that SOS (1) has a significantly higher performance and (2) is more robust to data perturbations and varying densities than the four related algorithms. From these results we may conclude that SOS is an effective algorithm for selecting outliers.
The research required to formulate an answer to RQ3 is split over the Chapters 5 and 6. In Chapter 5, we discuss the No Free Lunch theorem, which states that there is no single best one-class classifier. For each one-class data set, there may be a different one-class classifier that performs best. The performance of a one-class classifier is greatly determined by the characteristics, or meta-features, of the one-class data set. We present three methods for preprocessing data that may improve the performance of certain one-class classifiers on certain one-class data sets. The results in this chapter are obtained in three steps. First, we implement 36 meta-features. Second, we apply the meta-features to 255 data sets. Third, we select empirically the most informative meta-features. The selected meta-features are used as input for Chapter 6, where we use meta-learning to relate the meta-features to the one-class classifier performance.
In Chapter 6, we provide an answer to RQ3 by using the output from Chapter 5. First, we describe 19 one-class classifiers. Subsequently, we perform a comparative experiment by applying the 19 one-class classifiers on the 255 one-class data sets from Chapter 5. To answer RQ3, we aim to understand the relationship between the previously computed meta-features and the one-class classifier performances. To this end, we set up a meta-learning experiment, where we aim to solve the one-class classifier selection problem, i.e., to predict the most appropriate one-class classifier for a given, unseen, one-class data set. The results indicate to what extent we can solve the one-class classifier selection problem by using a meta-learning approach.
In Chapter 7 we complete the thesis. We discuss the findings of the thesis on a general level and suggest four recommendations for future research. By reviewing the answers to the three research questions, we arrive at three conclusions. First, we demonstrated how to evaluate and compare the performance of outlier-selection algorithms and one-class classifiers. Second, we developed a new state-of-the art outlier-selection algorithm called SOS. Third, we have shown that a meta-learning approach can provide a domain expert with insight into when to apply which one-class classifier given a certain data set. Seeing that the three conditions mentioned in paragraph 3 have been satisfied, we may now answer our problem statement. Our conclusion reads that the domain expert can be supported by outlier-selection algorithms to some extent.