论文原文：HTML
论文年份：2020
论文被引：396(2020/10/03) 696(2022/03/26)

文章目录

Abstract
Introduction
Categorization of Anomaly Detection
- Anomaly Detection Setups
- Anomaly Detection Algorithm Output
- Types of Anomalies
- Normalization
Related Work
Unsupervised Anomaly Detection Algorithms
- k-NN Global Anomaly Detection
- Local Outlier Factor (LOF)
- Connectivity-Based Outlier Factor (COF)
- Influenced Outlierness (INFLO)
- Local Outlier Probability (LoOP)
- Local Correlation Integral (LOCI)
- Approximate Local Correlation Integral (aLOCI)
- Cluster-Based Local Outlier Factor (CBLOF/ uCBLOF)
- Local Density Cluster-based Outlier Factor (LDCOF)
- Clustering-based Multivariate Gaussian Outlier Score (CMGOS)
- Histogram-based Outlier Score (HBOS)
- One-Class Support Vector Machine
- Robust Principal Component Analysis (rPCA)
- Reviewing Algorithm Complexities and Implementation
Datasets for Benchmarking
- Dataset Summary
Comparative Evaluation
- Results of the Nearest-neighbor based Algorithms
- Results of the Clustering-based Algorithms
- Results of other Algorithms
- Computation Time Comparison of Algorithms
Conclusion
Acknowledgments
Author Contributions

Abstract

Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

异常检测是识别数据集中不同于标准的意外项目或事件的过程。与标准分类任务相比，异常检测通常应用于未标记的数据，只考虑数据集的内部结构。这一挑战被称为无监督异常检测，并在许多实际应用中得到解决，例如在网络入侵检测、欺诈检测以及生命科学和医学领域。在这一领域已经提出了几十种算法，但不幸的是，研究界仍然缺乏一个比较通用的评估以及公共可用的数据集。本研究解决了这些缺点，在来自多个应用领域的10个不同数据集上评估了19种不同的无监督异常检测算法。通过发布源代码和数据集，本文旨在为无监督异常检测研究提供一个新的、资金充足的基础。此外，本评估首次揭示了不同方法的优缺点。除了异常检测性能之外，还概述了计算工作、参数设置的影响以及全局/局部异常检测行为。最后，我们对典型的现实任务给出了算法选择的建议。

Introduction

In machine learning, the detection of “not-normal” instances within datasets has always been of great interest. This process is commonly known as anomaly detection or outlier detection. The probably first definition was given by Grubbs in 1969 [1]: “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs”. Although this definition is still valid today, the motivation for detecting these outliers is very different now. Back then, the main reason for the detection was to remove the outliers afterwards from the training data since pattern recognition algorithms were quite sensitive to outliers in the data. This procedure is also called data cleansing. After the development of more robust classifiers, the interest in anomaly detection decreased a lot. However, there was a turning point around the year 2000, when researchers started to get more interested in the anomalies itself, since they are often associated with particular interesting events or suspicious data records. Since then, many new algorithms have been developed which are evaluated in this paper. In this context, the definition of Grubbs was also extended such that today anomalies are known to have two important characteristics:

Anomalies are different from the norm with respect to their features and
They are rare in a dataset compared to normal instances.

在机器学习中，数据集内“非正常”实例的检测一直备受关注。这个过程通常被称为异常检测或异常值检测。Grubbs在1969年给出了第一个可能的定义：“离群值是一个看起来明显偏离样本集中其他成员的观察值”。虽然这个定义在今天仍然有效，但是现在检测这些异常值的动机已经大不相同了。当时，检测的主要原因是为了从训练数据中去除异常值，因为模式识别算法对数据中的异常值非常敏感。这个过程也称为数据清洗。随着分类器的不断发展，人们对异常检测的兴趣大大降低。然而，在2000年前后出现了一个转折点，当时研究人员开始对异常现象本身更感兴趣，因为它们往往与特别有趣的事件或可疑的数据记录有关。从那时起，许多新的算法被开发出来，本文对这些算法进行了评估。在这种情况下，Grubbs的定义也得到扩展，使得今天的异常值具有两个重要特征:

异常与正常的特征不同。
与正常情况相比，它们在数据集中很少见。

Anomaly detection algorithms are now used in many application domains and often enhance traditional rule-based detection systems.

现在，异常检测算法已在许多应用领域中使用，并且通常会增强传统的基于规则的检测系统。

Intrusion detection is probably the most well-known application of anomaly detection [2, 3]. In this application scenario, network traffic and server applications are monitored. Potential intrusion attempts and exploits should then be identified using anomaly detection algorithms. Besides this network-based intrusion detection, also host-based intrusion detection systems are available, commonly using system call data of a running computers. Most security vendors often call anomaly detection in this context behavioral analysis [4]. An important challenge in these often commercial Intrusion Detection Systems (IDS) is the huge amount of data to be processed in near real-time. For this reason, these systems typically use simple but fast anomaly detection algorithms. Intrusion detection systems are also a good example where anomaly detection complements traditional rule-based systems: They typically use pattern matching for the fast and reliable detection of known threats while an additional anomaly detection module tries to identify yet unknown suspicious activity.

入侵检测可能是异常检测最著名的应用[2，3]。在这个应用场景中，网络流量和服务器应用被监控。然后，应该使用异常检测算法来识别潜在的入侵。除了这种基于网络的入侵检测之外，基于主机的入侵检测系统也是可用的，通常使用正在运行的计算机的系统调用数据。在这种情况下，大多数安全供应商通常称异常检测为行为分析[4]。在这些通常是商业入侵检测系统中，一个重要的挑战是要近乎实时地处理大量的数据。因此，这些系统通常使用简单但快速的异常检测算法。入侵检测系统也是一个很好的例子，其中异常检测补充了传统的基于规则的系统：它们通常使用模式匹配（pattern matching）来快速可靠地检测已知威胁，而另一个异常检测模块试图识别未知的可疑活动。

Fraud detection is another application of anomaly detection [5]. Here, typically log data is analyzed in order to detect misuses of a system or suspicious events indicating fraud. In particular, financial transactions can be analyzed in order to detect fraudulent accounting [6] and credit card payments logs can be used to detect misused or stolen credit cards [7]. With the strong growth in internet payment systems as well as the increase of offered digital goods, such as ebooks, music, software and movies, fraud detection becomes more and more important in this area. This is due to the fact that pure digital transactions attract scammers since they are less likely to be identified in the real world.

欺诈检测是异常检测的另一个应用[5]。在这里，通常会分析日志数据，以检测系统的滥用或指示欺诈的可疑事件。特别是，可以对金融交易进行分析，以检测欺诈性会计[6]，并且可以使用信用卡支付日志来检测被滥用或被盗的信用卡[7]。随着互联网支付系统的强劲增长以及电子书、音乐、软件和电影等数字产品的增加，欺诈检测在这一领域变得越来越重要。这是因为纯数字交易吸引了骗子，因为他们在现实世界中不太可能被识别。

Data Leakage Prevention (DLP) is a third important application scenario, where sensitive information is protected by detecting data loss at an early stage [8]. In principle, it is similar to fraud detection, but with a focus on near-real-time analysis such that is serves as a precaution method. In this context, accesses to databases, file servers and other information sources are logged and analyzed in order to detect uncommon access patterns.

防止数据泄露（Data Leakage Prevention，DLP）是第三种重要的应用场景，通过在早期检测数据丢失来保护敏感信息[8]。原则上，它类似于欺诈检测，但侧重于近实时分析，这是一种预防方法。在这种情况下，会记录和分析对数据库、文件服务器和其他信息源的访问，以便检测不常见的访问模式。

In medical applications and life sciences, anomaly detection is also utilized. One example is patient monitoring, where electrocardiography (ECG) signals or other body sensors are used to detect critical, possibly life-threatening situations [9]. Additionally, anomaly detection is applied for analyzing medical images, for example computed tomography (CT) in order to detect abnormal cells or tumors. In this application, anomaly detection algorithms rely of course on complex image processing methods as a preprocessing step. In life sciences, anomaly detection might also be utilized to find pathologies and mutants.

在医学应用和生命科学中，也使用异常检测。一个例子是患者监护，其中心电图信号或其他身体传感器用于检测关键的，可能危及生命的情况[9]。此外，异常检测被应用于分析医学图像，例如计算机断层扫描(CT)，以便检测异常细胞或肿瘤。在该应用中，异常检测算法当然依赖于复杂的图像处理方法作为预处理步骤。在生命科学中，异常检测也可以用来发现病理和突变。

Besides these four main application areas, anomaly detection is also used in many specialized applications. For example, surveillance camera data can be analyzed for suspicious movements [10], in smart buildings energy consumption anomalies can be found [11], mobile communication networks can be monitored [12] and also forged documents can be detected by a forensic application investigating printed documents [13]. Finally, it is also used in very complex systems in order to detect critical states, of which engineers and developers did not possibly think about during designing the system [14]. Among all these very different application domains, synonyms are often used for the anomaly detection process, which include outlier detection, novelty detection, fraud detection, misuse detection, intrusion detection and behavioral analysis. However, the basic underlying techniques refer to the same algorithms presented in the following sections. More detailed information about application domains as well as overviews of proposed algorithms can be found in the comprehensive surveys of Chandola et al. [15], Hodge et al. [16], Pimentel et al. [17] and Markos et al. [18].

除了这四个主要应用领域之外，异常检测还用于许多专门的应用。例如，可以对监控摄像头数据进行可疑活动分析[10]，在智能建筑中可以发现能耗异常[11]，可以监控移动通信网络[12]，还可以通过调查打印文档的法庭应用程序检测伪造文档[13]。最后，它还用于非常复杂的系统，以检测关键状态，而工程师和开发人员在设计系统时不可能考虑到这些状态[14]。在所有这些非常不同的应用领域中，同义词（synonyms）通常用于异常检测过程，包括异常检测、新颖性检测、欺诈检测、误用检测、入侵检测和行为分析。然而，基本的底层技术指的是下面几节中介绍的相同算法。关于应用领域的更详细的信息以及提出的算法的概述可以在Chandola等 [15]，霍奇等 [16]，Pimentel等 [17] 和 Markos等 [18]的综述中找到。

As we can see from this huge variety, also different practical requirements exist for anomaly detection algorithms. In some cases they have to be very fast in a near real-time fashion. In other cases, detection performance is more important due to a high cost of missing an anomaly. In this context, it is also possible to classify the application domains with respect to the point in time when an anomaly should be detected. Among the post-incident analysis and the near real-time detection, additionally a predictive-driven motivation exists, also know as early warning [19]. Of course, the latter is the most difficult anomaly detection task, but often major incidents are preceded by minor indications which can be detected.

从这个巨大的变化中我们可以看到，异常检测算法也存在不同的实际需求。在某些情况下，它们必须非常快以接近实时。在其他情况下，检测性能更为重要，因为错过异常的代价很高。在这种情况下，还可以根据应该检测到异常的时间点对应用程序进行分类。在事件后分析和近实时检测中，还存在预测驱动的动机，也称为早期预警[19]。当然，后者是最困难的异常检测任务，但通常重大事件之前会有可以检测到的微小迹象。

In this article we present a comparative evaluation of a large variety of anomaly detection algorithms. Clearly, anomaly detection performance is one very important factor for algorithm selection. However, we will also outline strength and weaknesses of the algorithms with respect to their usefulness for specific application scenarios additionally. This supports our main goal that this work should serve as a guideline for selecting an appropriate unsupervised anomaly detection algorithm for a given task.

在本文中，我们对各种异常检测算法进行了比较评估。显然，异常检测性能是算法选择的一个非常重要的因素。然而，我们还将概述算法的优点和缺点，以及它们对于特定应用场景的有用性。这支持了我们的主要目标，即这项工作应该作为为给定任务选择合适的无监督异常检测算法的指南。

Categorization of Anomaly Detection

Anomaly Detection Setups

In contrast to the well-known classification setup, where training data is used to train a classifier and test data measures performance afterwards, there are multiple setups possible when talking about anomaly detection. Basically, the anomaly detection setup to be used depends on the labels available in the dataset and we can distinguish between three main types as illustrated in Fig 1:

与众所周知的分类设置不同，在众所周知的分类设置中，训练数据用于训练分类器，然后测试数据测量性能，在谈到异常检测时，有多种设置是可能的。基本上，要使用的异常检测设置取决于数据集中可用的标签，我们可以区分三种主要类型，如图1所示：

图1。不同的异常检测模式取决于数据集中标签的可用性。监督异常检测使用完全标记的数据集进行训练。半监督异常检测使用无异常的训练数据集。之后，测试数据与正常模型的偏差被用来检测异常。无监督的异常检测算法仅使用数据的内在信息，以检测偏离大多数数据的情况。

Supervised Anomaly Detection describes the setup where the data comprises of fully labeled training and test data sets. An ordinary classifier can be trained first and applied afterwards. This scenario is very similar to traditional pattern recognition with the exception that classes are typically strongly unbalanced. Not all classification algorithms suit therefore perfectly for this task. For example, decision trees like C4.5 [20] cannot deal well with unbalanced data, whereas Support Vector Machines (SVM) [21] or Artificial Neural Networks (ANN) [22] should perform better. However, this setup is practically not very relevant due to the assumption that anomalies are known and labeled correctly. For many applications, anomalies are not known in advance or may occur spontaneously as novelties during the test phase.
监督异常检测描述了数据由完全标记的训练和测试数据集组成的设置。一个普通的分类器先训练，后应用。这个场景非常类似于传统的模式识别，除了类通常非常不平衡。因此，并非所有的分类算法都完全适合这项任务。例如，像C4.5 [20]这样的决策树不能很好地处理不平衡的数据，而支持向量机(SVM) [21]或人工神经网络[22]应该表现得更好。然而，这种设置实际上并不十分相关，因为假设异常是已知的并且标记正确。对于许多应用来说，异常是事先不知道的，或者在测试阶段可能会自发出现。
Semi-supervised Anomaly Detection also uses training and test datasets, whereas training data only consists of normal data without any anomalies. The basic idea is, that a model of the normal class is learned and anomalies can be detected afterwards by deviating from that model. This idea is also known as “one-class” classification [23]. Well-known algorithms are One-class SVMs [24] and autoencoders [25]. Of course, in general any density estimation method can be used to model the probability density function of the normal classes, such as Gaussian Mixture Models [26] or Kernel Density Estimation [27].
半监督异常检测还使用训练和测试数据集，而训练数据仅由正常数据组成，没有任何异常。基本思想是，学习正常类的模型，然后通过偏离该模型来检测异常。这种想法也被称为 one-class classification [23]。众所周知的算法是One-class SVMs[24]和自动编码器（autoencoders）[25]。当然，一般来说，任何密度估计方法都可以用来对正常类的概率密度函数进行建模，例如高斯混合模型（GMM）[26]或核密度估计（KDE）[27]。
Unsupervised Anomaly Detection is the most flexible setup which does not require any labels. Furthermore, there is also no distinction between a training and a test dataset. The idea is that an unsupervised anomaly detection algorithm scores the data solely based on intrinsic properties of the dataset. Typically, distances or densities are used to give an estimation what is normal and what is an outlier. This article only focuses on this unsupervised anomaly detection setup.
无监督异常检测是最灵活的设置，不需要任何标签。此外，训练数据集和测试数据集之间也没有区别。其思想是无监督的异常检测算法仅基于数据集的内在属性对数据进行评分。通常，距离或密度函数用于估计正常和异常。本文只关注这种无监督的异常检测设置。

Anomaly Detection Algorithm Output

As an output of an anomaly detection algorithm, two possibilities exist. First, a label can be used as a result indicating whether an instance is an anomaly or not. Second, a score or confidence value can be a more informative result indicating the degree of abnormality. For supervised anomaly detection, often a label is used due to available classification algorithms. On the other hand, for semi-supervised and unsupervised anomaly detection algorithms, scores are more common. This is mainly due to the practical reasons, where applications often rank anomalies and only report the top anomalies to the user. In this work, we also use scores as output and rank the results such that the ranking can be used for performance evaluation. Of course, a ranking can be converted into a label using an appropriate threshold.

作为异常检测算法的输出，存在两种可能性。首先，标签可以用作指示实例是否异常的结果。第二，分数或置信度值可以是指示异常程度的更具信息性的结果。对于有监督的异常检测，由于可用的分类算法，通常使用标签。另一方面，对于半监督和无监督的异常检测算法，分数更常见。这主要是由于实际原因，应用程序通常会对异常进行排序，并且只向用户报告最重要的异常。在这项工作中，我们还使用分数作为输出，并对结果进行排名，以便用于性能评估。当然，排名可以使用适当的阈值转换成标签。

Types of Anomalies

The main idea of unsupervised anomaly detection algorithms is to detect data instances in a dataset, which deviate from the norm. However, there are a variety of cases in practice where this basic assumption is ambiguous. Fig 2 illustrates some of these cases using a simple twodimensional dataset. Two anomalies can be easily identified by eye: x1 and x2 are very different from the dense areas with respect to their attributes and are therefore called global anomalies. When looking at the dataset globally, x3 can be seen as a normal record since it is not too far away from the cluster c2. However, when we focus only on the cluster c2 and compare it with x3 while neglecting all the other instances, it can be seen as an anomaly. Therefore, x3 is called a local anomaly, since it is only anomalous when compared with its close-by neighborhood. It depends on the application, whether local anomalies are of interest or not. Another interesting question is whether the instances of the cluster c3 should be seen as three anomalies or as a (small) regular cluster. These phenomena is called micro cluster and anomaly detection algorithms should assign scores to its members larger than the normal instances, but smaller values than the obvious anomalies. This simple example already illustrates that anomalies are not always obvious and a score is much more useful than a binary label assignment.

无监督异常检测算法的主要思想是检测数据集中偏离标准的数据实例。然而，在实践中有些情况下，这一基本假设是模糊的。图2用一个简单的二维数据集说明了其中的一些情况。两种异常可以很容易地被肉眼识别出来：x1和x2在属性上与密集区域非常不同，因此被称为全局异常。当全局观察数据集时，x3可以被视为正常记录，因为它离聚类/簇c2并不太远。然而，当我们只关注聚类/簇c2并将其与x3进行比较，而忽略所有其他实例时，x3可以被视为异常（局部异常），它只有在与其邻近的邻域相比较时才是异常的，这取决于应用是否关注局部异常。另一个有趣的问题是，聚类/簇c3应该被视为三个异常还是一个正常聚类/簇。这些现象被称为微聚类（micro cluster），异常检测算法应该为其成员分配比正常情况大但比明显异常小的分值。这个简单的例子已经说明了异常并不总是明显的，分数比二分类标签赋值更有用。

图2。一个简单的二维例子。说明了全局异常(x1，x2)、局部异常x3和微聚类c3。

Fortunately, it is still possible to utilize point anomaly detection algorithms to detect contextual and collective anomalies. In order to do so, the context itself can be included as a new feature. Concerning our simple example of the temperature measurement, a direct inclusion of the month as a second feature is easily possible. However, in more complex scenarios, one or more newly derived features might be required to transform the contextual anomaly detection task into a point anomaly detection problem. For addressing the collective anomalies, correlation, aggregation and grouping is used to generate a new dataset with a different representation of the features [11]. This transformation from a collective anomaly detection task to a point anomaly detection task requires a solid background knowledge of the dataset and often results in a point anomaly detection dataset, which features and instances are very different from the original raw data. This semantic transformation is also called the generation of a data view [28]. As we can conclude from these three different type of anomalies, not every anomaly detection task is suitable to be processed directly using an anomaly detection algorithm. In fact, many practical anomaly detection problems often require a preprocessing in order to generate the appropriate data views. In this work, we also carefully selected the datasets to be point anomaly detection problems, such that no further preprocessing is necessary and we can directly compare the detection performance of the different algorithms.

幸运的是，仍然可以利用点异常检测算法来检测上下文和集体异常。为了做到这一点，上下文本身可以作为一个新特性被包含进来。关于温度测量的简单例子，直接将月份作为第二个特征是很容易实现的。然而，在更复杂的场景中，可能需要一个或多个新导出的特征来将上下文异常检测任务转换成点异常检测问题。[11]为了解决集体异常，使用相关、聚合和分组来生成具有不同特征表示的新数据集。从集体异常检测任务到点异常检测任务的这种转换需要数据集的坚实背景知识，并且通常导致点异常检测数据集，该数据集的特征和实例与原始原始数据非常不同。这种语义转换也被称为数据视图的生成[28]。正如我们可以从这三种不同类型的异常中得出的结论，并不是每个异常检测任务都适合直接使用异常检测算法来处理。事实上，许多实际的异常检测问题通常需要预处理，以便生成适当的数据视图。在这项工作中，我们还仔细选择了数据集作为点异常检测问题，这样就不需要进一步的预处理，我们可以直接比较不同算法的检测性能。

Normalization

When a dataset was preprocessed such that it represents a point anomaly detection problem, the final step before the unsupervised anomaly detection algorithm can be applied is normalization. Similar to the generation of the data view, normalization should also be performed with taking background knowledge into account. Typical normalization methods are min-max normalization, where every feature is normalized into a common interval (for example [0, 1]), and standardizing, where each feature is transformed such that its mean is zero and its standard deviation is one. In practical applications, the min-max normalization is often used, so do we in the evaluation in this article. Please note, that sometimes straight-forward normalization can also be contra-productive. Let’s assume we have a categorical binary feature converted to [0, 1] and a numerical value measuring a length normalized to [0, 1]. Since the categorical binary feature results in distances being either one or zero, it has a much bigger influence on the result as the numerical value. This is the reason, why background information is also important during normalization to avoid these errors in the normalization process. Possible solutions to that challenge include using different intervals for the different semantic features, or when categorical features come into play, using a weighted distance function [29].

当数据集被预处理以使其代表点异常检测问题时，在可以应用无监督异常检测算法之前的最后一步是归一化（normalization）。类似于数据视图的生成，归一化也应该在考虑背景知识的情况下执行。典型的归一化方法是min-max归一化，其中每个特征被归一化到同一区间，例如[0，1]；以及标准化（standardizing），其中每个特征被转换，使得其均值为零，其标准差为一。在实际应用中，经常使用min-max归一化，我们在本文的评估中也是如此。请注意，有时直截了当的归一化也可能会适得其反。假设我们有一个分类二元特征转换为[0，1]和测量归一化为[0，1]的长度的数值。因为分类二进制特征导致距离为1或0，所以它作为数值对结果有更大的影响。这就是为什么背景信息在标准化过程中也很重要，以避免标准化过程中的这些错误。这一挑战的可能解决方案包括对不同的语义特征使用不同的间隔，或者当分类特征开始发挥作用时，使用加权距离函数（weighted distance function）[29]。

Related Work

It could already be inferred from the previous sections that this article primarily deals with multivariate tabular data. Differently structured data, such as graphs or sequential data, is often processed in machine learning using dedicated algorithms. This also holds true in anomaly detection and there exist many algorithms for detecting anomalies in graphs [30], in sequences and time series [31] and for addressing spatial data [32]. However, these specialized algorithms are not evaluated in this work, which focuses on tabular data.

从前面几节已经可以推断出，本文主要处理多元表格数据（multivariate tabular data）。不同结构的数据，如图或序列数据，通常在机器学习中使用专用算法进行处理。这在异常检测中也成立，并且存在许多算法用于检测图[30]、序列和时间序列[31]中的异常以及用于寻址空间数据[32]。然而，这些专门的算法在这项工作中没有得到评估，这项工作侧重于表格数据。

Not many comparative studies on unsupervised anomaly detection do exist today. On the one hand, authors of newly proposed algorithms compare their results with state-of-the-art techniques, for example LOF and k-NN, but often datasets are not published and the evaluation lacks in some other evaluation criteria (local vs. global or parameter k). On the other hand, some studies have been published referring to a specific application scenario, often with a single dataset only. Lazarevic et al. [33] compared LOF, k-NN, PCA and unsupervised SVM for intrusion detection using the KDD-Cup99 dataset. A similar study by Eskin et al. [34] evaluates clustering, k-NN as well as a one-class SVM on the same dataset. A broader study using six different methods for unsupervised anomaly detection was performed by NASA [14] for detecting engine failures of space shuttles. Unfortunately, the dataset is not available and the algorithms used are besides GMM and one-class SVMs four commercially available software systems. Auslander et al. [35] applied k-NN, LOF and clustering on maritime video surveillance data. Ding et al. [36] studied SVDD, a k-NN classifier, k-means and a GMM for detecting anomalies in ten different datasets. Although the authors claim to evaluate unsupervised techniques, the use of a training phase indicates a semi-supervised setup to our understanding. Carrasquilla [37] published a study on comparing different anomaly detection algorithms based on 10 different datasets. Unfortunately, only one unsupervised anomaly detection algorithm was applied, whereas its results were compared to other supervised anomaly detection algorithms. Some of the datasets used in this study are also used as a basis in our evaluation, but with an appropriate preprocessing. All related work concerning the particular algorithms used in this study can be found in the next section.

如今，关于无监督异常检测的比较研究并不多。一方面，新提出的算法的作者将他们的结果与最先进的技术进行比较，例如LOF和k-NN，但是通常数据集没有公布，并且评估缺乏一些其他的评估标准（局部，全局或参数K）。另一方面，已经发表了一些涉及特定应用场景的研究，通常仅使用单个数据集。

Lazarevic等[33]比较了使用KDD-Cup99数据集的LOF，k-NN，PCA和无监督SVM进行入侵检测。 Eskin等人的类似研究[34]在同一数据集上评估聚类，k-NN和一类SVM。 NASA[14]使用六种不同方法对无监督异常进行了更广泛的研究，以检测航天飞机的发动机故障。不幸的是，该数据集不可用，并且所使用的算法除了GMM和一类SVM以外，还有四种可商购的软件系统。 Auslander等[35]在海上视频监视数据上应用了k-NN，LOF和聚类。Ding等[36]研究了SVDD，k-NN分类器，k-means和GMM，用于检测十个不同数据集中的异常。尽管作者声称要评估无监督技术，但训练阶段的使用表明我们了解的是半监督设置。Carrasquilla[37]发表了一项研究，该研究基于10个不同的数据集比较了不同的异常检测算法。不幸的是，仅应用了一种无监督异常检测算法，而将其结果与其他有监督异常检测算法进行了比较。本研究中使用的一些数据集也被用作我们评估的基础，但经过适当的预处理。下一部分将找到与本研究中使用的特定算法有关的所有相关工作。

Besides studies evaluating a single algorithm only, outlier ensembles [38, 39] is a technique of combining multiple unsupervised anomaly detection algorithms in order to boost their joint anomaly detection performance. Since unsupervised anomaly detection does not rely on labeled data, this task is very challenging and often restricted to simple combinations. In this article we do not evaluate ensembles, although the results reported here might serve as a selection criteria for these algorithms in this currently evolving new research field.

除了只评估单个算法的研究之外，离群点集成（outlier ensembles）[38，39]是一种组合多个无监督异常检测算法的技术，以提高它们的联合异常检测性能。由于无监督的异常检测不依赖于标记数据，因此这项任务非常具有挑战性，并且通常仅限于简单的组合。在这篇文章中，我们不评估集成，尽管这里报告的结果可以作为这些算法在这个目前正在发展的新研究领域的选择标准。

Unsupervised Anomaly Detection Algorithms

Unsupervised anomaly detection algorithms can be roughly categorized into the following main groups [15] as illustrated in Fig 3: (1) Nearest-neighbor based techniques, (2) Clusteringbased methods and (3) Statistical algorithms. Recently, also a new group is emerging based on (4) Subspace techniques. In this work, we cover all of these categories with a focus on nearestneighbor and clustering-based anomaly detection, the by far most used categories in practice.

如图3所示，无监督异常检测算法可以大致分为以下几大类[15]：(1)基于最近邻的技术，(2)基于聚类的方法，(3)统计算法。最近，基于(4)子空间技术的一个新的聚类方法也正被提出。在这项工作中，我们涵盖了所有这些类别，重点是最近邻和基于聚类的异常检测，这是迄今为止实践中最常用的类别。

Furthermore, other algorithms exist, which are not direct members of these categories, often based on available classification algorithms such as neural networks [25] or support vector machines [40]. It is not an easy task to select a proper subset for this work keeping in mind that dozens of different algorithms have been proposed. However, our selection is based on practical applications in the past and attention in the scientific community. Since one goal of this work is also to standardize datasets, we also welcome other researchers to compare their proposed methods with the results presented here. In this section, we shortly introduce the algorithms and their main ideas, but due to the high number of algorithms, the interested reader is referred to the authors original publication for more details.

此外，还存在其他算法，它们不是这些类别的直接成员，通常基于可用的分类算法，如神经网络[25]或支持向量机[40]。考虑到已经提出了几十种不同的算法，为这项工作选择一个合适的子集并不是一件容易的事情。然而，我们的选择是基于过去的实际应用和科学界的关注。由于这项工作的一个目标也是标准化数据集，我们也欢迎其他研究人员将他们提出的方法与这里给出的结果进行比较。在这一节中，我们将简要介绍算法及其主要思想，但由于算法数量众多，感兴趣的读者可以参考作者的原始出版物了解更多详细信息。

图3。无监督异常检测算法的分类，包括四个主要组。请注意，CMGOS可以分为两组:它是一种基于聚类的算法，以及估计每个聚类的子空间。

k-NN Global Anomaly Detection

The k-nearest-neighbor global unsupervised anomaly detection algorithm is a straightforward way for detecting anomalies and not to be confused with k-nearest neighbor classification. As the name already implies, it focuses on global anomalies and is not able to detect local anomalies. First, for every record in the dataset, the k-nearest-neighbors have to be found. Then, an anomaly score is computed using these neighbors, whereas two possibilities have been proposed: Either the distance to the kth-nearest-neighbor is used (a single one) [41] or the average distance to all of the k-nearest-neighbors [42] is computed. In the following, we refer to the first method as kth-NN and the latter as k-NN. In practical applications, the k-NN method is often preferred [13, 19]. However, the absolute value of the score depends very much on the dataset itself, the number of dimensions, and on normalization. As a result, it is in practice not easy to select an appropriate threshold, if required.

k-近邻全局无监督异常检测算法是一种检测异常的简单方法，不要与k-近邻分类混淆。顾名思义，它专注于全局异常，不能检测局部异常。首先，对于数据集中的每条记录，必须找到k近邻。然后，使用这些邻居来计算异常分数，而已经提出了两种可能性:或者使用到第k个最近邻居的距离(单个)[41]或者计算到所有k个最近邻居的平均距离[42]。在下文中，我们将第一种方法称为kt-NN，将后者称为k-NN。在实际应用中，通常首选k-NN方法[13，19]。然而，分数的绝对值在很大程度上取决于数据集本身、维数和归一化。因此，在实践中，如果需要，选择适当的阈值并不容易。

The choice of the parameter k is of course important for the results. If it is chosen too low, the density estimation for the records might be not reliable. On the other hand, if it is too large, density estimation may be too coarse. As a rule of thumb, k should be in the range 10 < k < 50. In classification, it is possible to determine a suitable k, for example by using cross-validation. Unfortunately, there is no such technique in unsupervised anomaly detection due to missing labels. For that reason, we use later in the evaluation many different values for k and average in order to get a fair evaluation when comparing algorithms.

参数k的选择对结果很重要。如果选择得太低，记录的密度估计可能不可靠。另一方面，如果太大，密度估计可能太粗糙。根据经验，k应该在10 < k < 50的范围内。在分类中，有可能确定一个合适的k，例如通过使用交叉验证。不幸的是，由于缺少标签，在无监督的异常检测中没有这样的技术。出于这个原因，我们在后面的评估中使用许多不同的k值和平均值，以便在比较算法时得到一个公平的评估。

In Fig 4 we exemplary illustrate how the result of an unsupervised anomaly detection algorithm (here: k-NN with k = 10) can be visualized. The plot was generated using a simple, artificially generated two-dimensional dataset with four Gaussian clusters and uniformly sampled anomalies. After applying the global k-NN, the outlier scores are visualized by the bubble-size of the corresponding instance. The color indicates the label, whereas anomalies are red. It can be seen, that k-NN cannot detect the anomalies close to the clusters well and assign small scores.

在图4中，我们示例性地说明了无监督异常检测算法(这里:k = 10的k-NN)的结果是如何可视化的。该图是使用简单的人工生成的二维数据集生成的，该数据集具有四个高斯聚类和均匀采样的异常。在应用全局k-NN之后，离群值分数通过相应实例的气泡大小来可视化。颜色表示标签，异常为红色。可以看出，k-NN不能很好地检测出接近聚类的异常，并分配较小的分数。

图4。k-NN全局异常检测算法结果的可视化。异常分数由气泡大小表示，而颜色显示人工生成数据集的标签。

Local Outlier Factor (LOF)

The local outlier factor [43] is the most well-known local anomaly detection algorithm and also introduced the idea of local anomalies first. Today, its idea is carried out in many nearestneighbor based algorithms, such as in the ones described below. To calculate the LOF score, three steps have to be computed:

局部异常因子（local outlier factor，LOF）[43]是最著名的局部异常检测算法，也是最先引入局部异常的思想。今天，它的思想在许多基于最近邻的算法中得到了实现，例如下面描述的那些算法。要计算LOF分数，必须计算三个步骤：

The k-nearest-neighbors have to be found for each record x. In case of distance tie of the kth neighbor, more than k neighbors are used.
必须为每个记录x找到k个最近的邻居。在第k个邻居距离相等的情况下，使用k个以上的邻居。
Using these k-nearest-neighbors Nk, the local density for a record is estimated by computing the local reachability density (LRD):
使用这些k近邻Nk，通过计算本地可达性密度(LRD)来估计记录的本地密度：

whereas dk(·) is the reachability distance. Except for some very rare situations in highly dense clusters, this is the Euclidean distance.
其中，dk(⋅)d_k(·)dk(⋅) 是可达性距离。除了一些在高密度聚类/簇中非常罕见的情况，这就是欧氏距离。
Finally, the LOF score is computed by comparing the LRD of a record with the LRDs of its k neighbors:
最后，通过比较一个记录的LRD与其k个邻居的LRDs来计算LOF分数：

The LOF score is thus basically a ratio of local densities. This results in the nice property of LOF, that normal instances, which densities are as big as the densities of their neighbors, get a score of about 1.0. Anomalies, which have a low local density, will result in larger scores. At this point it is also clear why this algorithm is local: It only relies on its direct neighborhood and the score is a ratio mainly based on the k neighbors only. Of course, global anomalies can also be detected since they also have a low LRD when comparing with their neighbors. It is important to note that in anomaly detection tasks, where local anomalies are not of interest, this algorithm will generate a lot of false alarms as we found out during our evaluation . Again, the setting of k is crucial for this algorithm. Besides trying out different values for k, the authors of the algorithm suggested to use an ensemble strategy for computing the LOF [43]. Here, scores for different k’s up to an upper bound are computed and then, the maximum of these scores is taken. Besides computing the LOF score for a single k, w e also take this strategy into account in our evaluation, referring to it as LOF-UB (upper bound). For comparison reasons, we also use different upper bounds and average the results again.

因此，LOF分数基本上是当地人口密度的比率。这导致了LOF的优良特性，正常情况下，密度与邻居的密度一样大，得到的分数约为1.0。局部密度低的异常将导致较大的分数。在这一点上，也很清楚为什么这个算法是局部的：它只依赖于它的直接邻域，分数是一个主要基于k个邻域的比率。当然，全局异常也可以被检测到，因为与它们的邻居相比，它们也具有较低的LRD。值得注意的是，在异常检测任务中，如果对局部异常不感兴趣，该算法将会产生大量的错误警报，正如我们在评估过程中发现的那样。同样，k的设置对该算法至关重要。除了尝试不同的k值，该算法的作者建议使用一种集成策略来计算LOF [43]。这里，计算不同k的分数，直到一个上限，然后取这些分数的最大值。除了计算单个k的LOF分数，我们在评估中也考虑了这个策略，称之为LOF-UB(上限)。出于比较的原因，我们也使用不同的上限，并再次对结果进行平均。

Connectivity-Based Outlier Factor (COF)

The connectivity-based outlier factor [44] is similar to LOF, but the density estimation for the records is performed differently. In LOF, the k-nearest-neighbors are selected based on the Euclidean distance. This indirectly assumes, that the data is distributed in a spherical way around the instance. If this assumption is violated, for example if features have a direct linear correlation, the density estimation is incorrect. COF wants to compensate this shortcoming and estimates the local density of the neighborhood using an shortest-path approach, called the chaining distance. Mathematically, this chaining distance is the minimum of the sum of all distances connecting all k neighbors and the instance. For simple examples, where features are obviously correlated, this density estimation approach performs much more accurate [29]. Fig 5 shows the outcome for LOF and COF in direct comparison for a simple two-dimensional dataset, where the attributes have a linear dependency. It can be seen that the spherical density estimation of LOF cannot detect the outlier, but COF succeeded by connecting the normal records with each other for estimating the local density.

基于连通性的异常因子[44]与LOF相似，但记录的密度估计执行方式不同。在LOF，基于欧几里得距离选择k近邻。这间接假设了数据以球形方式分布在实例周围。如果违反了这一假设，例如，如果特征具有直接的线性相关性，则密度估计是不正确的。COF想弥补这一缺陷，他使用最短路径法(称为链接距离)来估计邻域的局部密度。数学上，这个链接距离是连接所有k个邻居和实例的所有距离之和的最小值。对于特征明显相关的简单例子，这种密度估计方法执行得更精确[29]。图5显示了LOF和COF在一个简单的二维数据集的直接比较中的结果，其中属性具有线性相关性。可以看出，LOF的球形密度估计不能检测异常值，但COF成功地将正常记录相互连接起来以估计局部密度。

图5。使用两个属性线性相关的简单数据集比较COF(上图)和LOF(下图)。可以看出，LOF的球形密度估计未能识别异常，而COF检测到非线性异常(k = 4)

Influenced Outlierness (INFLO)

When a dataset contains clusters with different densities and they are close to each other, it can be shown that LOF fails scoring the instances at the borders of the clusters correctly. The influenced outlierness (INFLO) [45] algorithm uses besides the k-nearest-neighbors additionally a reverse nearest neighborhood set, in which records are stored for with the current record is a neighbor. For computing the INFLO score, both neighborhood sets are combined. Then, the local density of this set and the score is computed the same way as for LOF. This procedure is illustrated in Fig 6, where for the red instance the 6-nearest-neighbors reside in the gray area. This red instance will clearly be detected as an anomaly by LOF, since five of its neighbors have a much higher local density. For INFLO, also the instances are taken into account for which the red instance is a neighbor (the blue instances). Using this extended set, the red instance is less likely to be detected as an anomaly by INFLO. Please note, that the set of k-nearest-neighbors typically contains k instances (with the exception of ties), whereas the reverse nearest neighborhood set may contain any amount. Depending on the data, it might contain no instance, exactly k or even more instances. When using this strategy, it is possible to compute more accurate anomaly scores when clusters of different densities are close to each other.

当数据集包含具有不同密度的聚类并且它们彼此接近时，可以看出LOF未能对聚类边界处的实例进行正确评分。受影响的异常值(INFLO)[45]算法除了k最近邻之外，还使用了一个反向最近邻集，其中存储了与当前记录相邻的记录。为了计算INFLO分数，两个邻域集被合并。然后，以与LOF相同的方式计算该集合的局部密度和分数。这个过程如图6所示，对于红色实例，6个最近邻居位于灰色区域。这个红色的实例显然会被LOF探测到，因为它的五个邻居有更高的本地密度。对于INFLO，红色实例是其邻居的实例(蓝色实例)也被考虑在内。使用这个扩展集，红色实例不太可能被INFLO检测为异常。请注意，k近邻集合通常包含k个实例(关系除外)，而反向近邻集合可以包含任何数量。根据数据，它可能不包含任何实例，确切地说是k个，甚至更多的实例。当使用这种策略时，当不同密度的聚类彼此接近时，可以计算更精确的异常分数。

图6。对比LOF和INLOF，可以看出反向邻域集的有用性。对于红色的例子，LOF只考虑灰色区域中的邻居，导致异常得分较高。此外，还考虑了蓝色实例(反向邻居)，因此红色实例得分更正常。

Local Outlier Probability (LoOP)

Until now, all presented algorithms output anomaly scores, which are more handy than binary labels. When comparing the global k-NN algorithm and LOF, the property of having a reference point for normal instances of LOF seems even better than the arbitrary score of k-NN. Unfortunately, it is still not clear in LOF, above which score threshold we can clearly think about an anomaly. The local outlier probability (LoOP) [46] tries to address this issue by outputting an anomaly probability instead of a score, which might also result in better comparison of anomalous records between different datasets.

到目前为止，所有提出的算法都输出异常分数，比二进制标签更方便。当比较全局k-NN算法和LOF时，LOF的正常实例具有参考点的特性似乎比k-NN的任意分数更好。不幸的是，在LOF仍然不清楚，在哪个分数阈值以上，我们可以清楚地想到一个异常。局部异常概率(LoOP) [46]试图通过输出异常概率而不是分数来解决这个问题，这也可能导致不同数据集之间异常记录的更好比较。

Similar to the previous local algorithms, LoOP also uses a neighborhood set for local density estimation. In contrast to other algorithms, it computes this density differently: The basic assumption is that the distances to the nearest neighbors follow a Gaussian distribution. Since distances are always positive, LoOP assumes a “half-Gaussian” distribution and uses its standard deviation, called the probabilistic set distance. It is used (similar to LOF) as a local density estimation—the ratios of each instance compared to its neighbors results in a local anomaly detection score. For converting this score into a probability, a normalization and a Gaussian error function is applied finally. The idea of having a probabilistic output instead of a score is very useful. However, some critical thoughts should arise in this context [29]. For example, if the algorithm assigns a 100% probability to an instance, what would happen, if we add another instance to the dataset which is more anomalous then that? As we can see from this simple example, probabilities are still relative to the data and might not differ too much from a normalized score.

与以前的局部算法类似，LoOP也使用邻域集进行局部密度估计。与其他算法相比，它计算密度的方式不同：基本假设是到最近邻居的距离遵循高斯分布。由于距离总是正的，LoOP假设一个“半高斯”分布，并使用其标准差，称为概率集距离（probabilistic set distance）。它被用作(类似于LOF)局部密度估计——每个实例与其邻居的比率导致局部异常检测得分。为了将该分数转换成概率，最后应用归一化和高斯误差函数。用概率输出代替分数的想法非常有用。然而，在这种背景下应该会产生一些批判性的想法[29]。例如，如果算法将100%的概率分配给一个实例，如果我们将另一个实例添加到比这更异常的数据集，会发生什么？从这个简单的例子中我们可以看出，概率仍然是相对于数据的，可能与标准化分数没有太大差异。

Local Correlation Integral (LOCI)

For all of the above algorithms, choosing k is a crucial decision for detection performance. As already mentioned, there is no way of estimating a good k based on the data. Nevertheless, the local correlation integral (LOCI) [47] algorithm addresses this issue by using a maximization approach. The basic idea is that all possible values of k are used for each record and finally the maximum score is taken. To achieve this goal, LOCI defines the r-neighborhood by using a radius r, which is expanded over time. Similar to LoOP, the local density is also estimated by using a half-Gaussian distribution, but here the amount of records in the neighborhood is used instead of the distances. Also, the local density estimation is different in LOCI: It compares two different sized neighborhoods instead of the ratio of the local densities. A parameter α controls the ratio of the different neighborhoods. Removing the critical parameter k comes at a price. Typically, nearest-neighbor based anomaly detection algorithms have computational complexity of O(n2) for finding the nearest neighbors. Since in LOCI additionally the radius r needs to be expanded from one instance to the furthest, the complexity increases to O(n3), which makes LOCI too slow for larger datasets.

对于上述所有算法，选择k是检测性能的关键决定。如前所述，没有办法根据数据估算出一个好的k值。然而，局部相关积分(LOCI) [47]算法通过使用最大化方法解决了这个问题。基本思想是对每条记录使用所有可能的k值，最后取最高分。为了实现这个目标，LOCI通过使用半径r来定义r-邻域，半径r随时间扩展。与LoOP类似，局部密度也是通过使用半高斯分布来估计的，但这里使用的是邻域中的记录数量，而不是距离。此外，局部密度估计在LOCI中是不同的：它比较两个不同大小的邻域，而不是局部密度的比率。参数α控制不同邻域的比率。去掉关键参数k是有代价的。通常，基于最近邻的异常检测算法在寻找最近邻时的计算复杂度为O(N2)。因为在LOCI中，半径r需要从一个实例扩展到最大，复杂性增加到O(n3)，这使得LOCI对于更大的数据集来说太慢。

Approximate Local Correlation Integral (aLOCI)

The authors of LOCI were aware of the long runtime and proposed aLOCI [48], a faster but approximate version of LOCI. aLOCI uses quad trees to speed up the counting of the two neighborhoods using some constraints for α. If a record is in the center of a cell of such a quad tree, the counting estimation is good, but if it is at the border, the approximation might be bad. For that reason, multiple (g) quad trees are constructed with the hope, that there is a good approximative tree for every instance. Furthermore, the tree depth (L) needs to be specified. The authors claim that the total complexity of their algorithm, comprising of tree creation and scoring, is O (NLdg + NL(dg + 2d)), whereas d is the number of dimensions. As typical for tree approaches, it can be seen that the number of dimensions has a very negative impact on the runtime. During our evaluation, we experienced very different results from aLOCI. Sometimes results seem reasonable and sometimes results showed a very poor anomaly detection performance. This observation was tracked down the tree creation process. For a perfect estimation, N trees are required. Since the trick of this algorithm is to use only g trees, this also turned out to be a weak point: If, by chance, the trees represented the normal instances well, many approximations were correct and thus the output of the algorithm. On the other hand, if the trees did not well represent the majority of the data, the anomaly detection performance was unacceptable.

LOCI的作者意识到运行时间长，并提出了LOCI [48]，LOCI的一个更快但近似的版本。aLOCI使用四叉树来加速两个邻域的计数，对α使用一些约束。如果一个记录在这样一个四叉树的一个单元的中心，计数估计是好的，但是如果它在边界，近似可能是坏的。因为这个原因，多重(g)四叉树被构建，希望每个实例都有一个好的近似树。此外，需要指定树深度(L)。作者声称，他们的算法的总复杂性，包括树的创建和评分，是 O(NLdg+NL(dg+2d))O (NLdg + NL(dg + 2^d))O(NLdg+NL(dg+2d))，而 ddd 是维数。对于典型的树方法，可以看出维数对运行时有非常负面的影响。在我们的评估过程中，我们经历了与aLOCI截然不同的结果。有时结果似乎合理，有时结果显示异常检测性能非常差。这一观察被追溯到树的创建过程。为了进行完美的估计，需要 NNN 棵树。由于这种算法的诀窍是只使用 ggg 树，这也是一个弱点：如果碰巧树很好地代表了正常情况，许多近似是正确的，因此算法的输出。另一方面，如果树不能很好地代表大部分数据，异常检测性能是不可接受的。

Cluster-Based Local Outlier Factor (CBLOF/ uCBLOF)

All previous anomaly detection algorithms are based on density estimation using nearestneighbors. In contrast, the cluster-based local outlier factor (CBLOF) [49] uses clustering in order to determine dense areas in the data and performs a density estimation for each cluster afterwards. In theory, every clustering algorithm can be used to cluster the data in a first step. However, in practice k-means is commonly used to take advantage of the low computational complexity, which is linear compared to the quadratic complexity of the nearest-neighbor search.After clustering, CBLOF uses a heuristic to classify the resulting clusters into large and small clusters. Finally, an anomaly score is computed by the distance of each instance to its cluster center multiplied by the instances belonging to its cluster. For small clusters, the distance to the closest large cluster is used. The procedure of using the amount of cluster members as a scaling factor should estimate the local density of the clusters as stated by the authors. We showed in previous work that this assumption is not true [50] and might even result in a incorrect density estimation. Therefore, we additionally evaluate a modified version of CBLOF which simply neglects the weighting, referred to as unweighted-CBLOF (uCBLOF) in the following. The results of uCBLOF using a simple two-dimensional dataset are visualized in Fig 7, where the color corresponds to the clustering result of the preceding k-means clustering algorithm. Similar to the nearest-neighbor based algorithms, the number of initial clusters k is also a critical parameter. Here, we follow the same strategy as for the nearest-neighbor based algorithms and evaluate many different k values. Furthermore, k-means clustering is a non-deterministic algorithm and thus the resulting anomaly scores can be different on multiple runs.To this end we follow a common strategy, which is to apply k-means many times on the data and pick the most stable result. However, clustering-based anomaly detection algorithms are very sensitive to the parameter k, since adding just a single additional centroid might lead to a very different outcome.

所有以前的异常检测算法都是基于使用最近邻的密度估计。相比之下，基于聚类的局部离群因子(cluster-based local outlier factor，CBLOF) [49]使用聚类来确定数据中的密集区域，然后为每个聚类执行密度估计。理论上，每种聚类算法都可以用来在第一步对数据进行聚类。然而，在实践中，k-means通常用于利用低计算复杂度，这与最近邻搜索的二次复杂度相比是线性的。聚类后，CBLOF使用启发式算法将生成的聚类分为大聚类和小聚类。最后，通过每个实例到其聚类/簇中心的距离乘以属于其聚类/簇的实例来计算异常分数。对于小簇，使用到最近的大簇的距离。使用聚类/簇成员数量作为比例因子的过程应该如作者所述估计聚类/簇的局部密度。我们在以前的工作中表明，这个假设是不正确的[50]，甚至可能导致不正确的密度估计。因此，我们还评估了一个简单忽略权重的CBLOF的修改版本，在下文中称为未加权CBLOF (uCBLOF)。使用简单二维数据集的uCBLOF的结果在图7中可视化，其中颜色对应于前述k-均值聚类算法的聚类结果。类似于基于最近邻的算法，初始聚类数k也是一个关键参数。这里，我们遵循与基于最近邻的算法相同的策略，并评估许多不同的k值。此外，k-means聚类是一种非确定性算法，因此在多次运行中得到的异常分数可能不同。为此，我们遵循一个共同的策略，即在数据上多次应用k-means并选择最稳定的结果。然而，基于聚类的异常检测算法对参数k非常敏感，因为只增加一个质心可能会导致非常不同的结果。

图7。uCBLOF算法结果的可视化。异常分数由气泡大小表示，而颜色对应于前面k均值聚类算法的聚类结果。使用uCBLOF显然无法检测到局部异常。

Local Density Cluster-based Outlier Factor (LDCOF)

As already pointed out, the local density estimation of CBLOF using only the amount of cluster members is controversial. Our extension uCBLOF is in fact not a local anomaly detection method any more since the density of the cluster is completely neglected. The local density cluster-based outlier factor LDCOF [50] addresses this shortcoming by estimating the clusters’ densities assuming a spherical distribution of the cluster members. Similar to CBLOF, k-means clustering is performed first as well as the following procedure of separating the clusters into small and large clusters. Then, for each cluster, the average distance of all cluster members to the centroid is calculated. Finally, the LDCOF score is computed by dividing the distance of an instance to its cluster center by the average distance. The result of this procedure is that the LDCOF score is a local score with respect to the possibly varying cluster densities. One advantage of LDCOF is also that the score has some relative reference point, similar to LOF: A score of 1.0 or below will be assigned to normal instances.

如前所述，仅使用聚类成员数量的局部密度估计是有争议的。我们的扩展uCBLOF实际上不再是一种局部异常检测方法，因为聚类/簇的密度被完全忽略了。基于局部密度聚类的离群因子LDCOF(local density cluster-based outlier factor)[50]通过假设聚类成员的球形分布来估计聚类的密度，从而解决了这个缺点。与CBLOF类似，首先执行k-means聚类，以及随后将聚类分成大小两个聚类的过程。然后，对于每个聚类，计算所有聚类成员到质心的平均距离。最后，通过将实例到其聚类/簇中心的距离除以平均距离来计算LDCOF分数。这个过程的结果是LDCOF分数是关于可能变化的聚类密度的局部分数。LDCOF的一个优点是分数有一些相对参考点，类似于LOF:1.0或更低的分数将被分配给正常情况。

Clustering-based Multivariate Gaussian Outlier Score (CMGOS)

The clustering-based multivariate Gaussian outlier score is another enhancement of clusterbased anomaly detection [29]. In CMGOS, the local density estimation is performed by estimating a multivariate Gaussian model, whereas the Mahalanobis distance [51] serves as a basis for computing the anomaly score. As in the previously introduced algorithms, a k-means clustering and the separation into small and large clusters is performed first. Then, for each cluster, the covariance matrix ∑\sum∑ is computed robustly. Finally, the CMGOS score is computed by dividing the Mahalanobis distance of an instance to its nearest cluster center by the chi-squared distribution with a certain confidence interval. The latter serves again as a normalization step such that outlier scores of 1.0 or below indicate a high probability of the instance to be normal. Due to the nature of the Mahalanobis distance, scores of outliers increase quickly, such that in practical applications extraordinary large scores can be observed (compared to other methods). For the estimation of the covariance matrix, robustness to outliers is essential since outliers are known to have a significant impact on the variance. In total, three different estimation methods are proposed: (1) Reduction. Here, the covariance is computed twice. After a first iteration, outliers are removed and covariance is computed again. This procedure is similar to the wellknown univariate Grubb’s Test [1]. (2) Regularization. This approach follows an idea from classification [52], where the covariance matrix is a weighted sum of the covariance matrix of the cluster and the global covariance matrix. (3) Minimum Covariance Determinant (MCD) [53]. This compute intense approach follows the idea to estimate a compact covariance matrix by a brute-force search for normal records, which is done by minimizing the determinant. Our evaluation is based on an approximative alternative [54], since the brute-force search is not suitable for large datasets. We refer in our evaluation to the methods as CMGOS-Red, CMGOS-Reg and CMGOS-MCD, respectively.

基于聚类的多元高斯异常值分数是基于聚类的异常检测的另一个增强[29]。在CMGOS中，通过估计多变量高斯模型来执行局部密度估计，而马氏距离[51]用作计算异常分数的基础。与前面介绍的算法一样，首先执行k-means聚类和分成大小聚类。然后，对于每个聚类，鲁棒地计算协方差矩阵∑\sum∑。最后，通过将实例到其最近的聚类中心的马氏距离除以具有一定置信区间的卡方分布来计算CMGOS分数。后者再次作为标准化步骤，使得异常值得分为1.0或更低表示该实例正常的概率很高。由于马哈拉诺比斯距离的性质，异常值的分数快速增加，使得在实际应用中可以观察到非常大的分数(与其他方法相比)。对于协方差矩阵的估计，对异常值的鲁棒性是至关重要的，因为已知异常值对方差有显著影响。总共提出了三种不同的估计方法:(1)归约（Reduction）。这里，协方差被计算两次。在第一次迭代之后，离群值被去除，协方差被再次计算。该程序类似于众所周知的单变量格拉布检验（Grubb’s Test）[1]。(2)正则化。这种方法遵循分类[52]的思想，其中协方差矩阵是聚类的协方差矩阵和全局协方差矩阵的加权和。(3)最小协方差行列式(MCD) [53]。这种计算密集的方法遵循这样的思想，即通过对正常记录的强力搜索来估计紧凑的协方差矩阵，这是通过最小化行列式来完成的。我们的评估是基于一个近似的选择[54]，因为蛮力搜索不适合大数据集。在我们的评估中，我们分别将这些方法称为CMGOS-Red、CMGOS-Reg和CMGOS-MCD。

Histogram-based Outlier Score (HBOS)

The histogram-based outlier score [55] is a simple statistical anomaly detection algorithm assuming independence of the features. The basic idea is, that for each feature of the dataset, a histogram is created. Then, for each instance in the dataset, the inverse height of the bins it resides (representing the density estimation) of all features are multiplied. The idea is very similar to the Naive Bayes algorithm in classification, were all independent feature probabilities are multiplied. The idea of using histograms for fast semi-supervised anomaly detection is also very popular in intrusion detection, were a histogram of normal data is learned [56]. On the first sight, it might seem a bit contra-productive to neglect the dependencies among features. However, this comes with a big advantage, which is processing speed. HBOS can process a dataset under a minute, whereas nearest-neighbor based computations take over 23 hours [29]. As a further remark, the drawback of assuming feature independence becomes less severe when a dataset has a high number of dimensions due to a larger sparsity. As a critical parameter, the number of bins k needs to be defined. Furthermore, HBOS allows two different histogram creation modes: (1) Static bin sizes with a fixed bin width and (2) dynamic bins such that the number of bins is approximately the same. The latter results in a histogram with different bin widths, but it can still be used for density estimation using the area of a bin. The advantage of the second histogram creation technique is, that the density estimation is more robust in case of having large outlying values. In our evaluation, we use this second “dynamic bin width” mode as well as different settings for k.

基于直方图的异常值分数[55]是一种简单的统计异常检测算法，假设特征是独立的。基本思想是，为数据集的每个特征创建一个直方图。然后，对于数据集中的每个实例，它所在的面元的反向高度(表示密度估计)乘以所有要素。这种思想与朴素贝叶斯分类算法非常相似，都是将独立的特征概率相乘。使用直方图进行快速半监督异常检测的想法在入侵检测中也非常流行，因为我们学习了正常数据的直方图[56]。乍一看，忽视特性之间的依赖性似乎有点反效果。然而，这带来了一个很大的优势，那就是处理速度。HBOS可以在一分钟内处理一个数据集，而基于最近邻的计算需要23小时以上[29]。作为进一步的说明，当数据集由于较大的稀疏性而具有较高的维数时，假设特征独立性的缺点变得不那么严重。作为一个关键参数，需要定义箱数k。此外，HBOS允许两种不同的直方图创建模式:(1)具有固定箱宽度的静态箱大小和(2)动态箱，使得箱的数量大致相同。后者产生具有不同面元宽度的直方图，但是它仍然可以用于使用面元面积的密度估计。第二种直方图创建技术的优点是，在具有大的外围值的情况下，密度估计更稳健。在我们的评估中，我们使用了第二种“动态面元宽度”模式以及不同的k设置。

One-Class Support Vector Machine

One-class support vector machines [24] are often used for semi-supervised anomaly detection [15]. In this setting, a one-class SVM is trained on anomaly-free data and later, the SVM classifies anomalies and normal data in the test set. One-class SVMs intend to separate the origin from the data instances in the kernel space, which results in some kind of complex hulls describing the normal data in the feature space. Although one-class SVMs are heavily used as a semi-supervised anomaly detection method, it is an unsupervised algorithm by design when using a soft-margin. In particular, it has been shown that they converge to the true density level set [57]. In the unsupervised anomaly detection scenario, the one-class SVM is trained using the dataset and afterwards, each instance in the dataset is scored by a normalized distance to the determined decision boundary [40]. The parameter ν needs to be set to a value lager than zero such that the contained anomalies are correctly handled by a soft-margin. Additionally, one-class SVMs have been modified such that they include further robust techniques for explicitly dealing with outliers during training [40]. The basic idea is that anomalies contribute less to the decision boundary as normal instances. Two different techniques were developed, whereas the η one-class SVM showed superior results. In this enhancement, η is a further optimization objective during training, which estimates the normality of an instance. In our evaluation we are using both unsupervised algorithms, the regular one-class SVM as well as the extended η one-class SVM.

一类支持向量机[24]通常用于半监督异常检测[15]。在这种情况下，一个一类SVM是在无异常数据的基础上训练的，然后，SVM在测试集中对异常和正常数据进行分类。一类支持向量机旨在从内核空间的数据实例中分离出原点，这导致了某种复杂的外壳（hulls）来描述特征空间中的正常数据。虽然一类支持向量机被大量用作半监督异常检测方法，但是当使用软边界时，它是一种无监督算法。特别是，已经表明它们收敛到真正的密度水平集（density level set）[57]。在无监督异常检测场景中，使用数据集训练一类SVM，然后，通过到确定的决策边界的归一化距离对数据集中的每个实例进行评分[40]。需要将参数ν设置为大于零的值，以便通过软余量来正确处理包含的异常。此外，一类支持向量机已被修改，以便它们包括进一步的稳健技术，用于在训练期间明确处理异常值[40]。基本思想是异常对决策边界的贡献比正常情况要小。开发了两种不同的技术，而η一类SVM显示出优越的结果。在该增强中，η是训练期间的进一步优化目标，其估计实例的正态性。在我们的评估中，我们使用了无监督算法，常规的一类SVM算法和扩展的η一类SVM算法。

Robust Principal Component Analysis (rPCA)

Principal component analysis is a commonly used technique for detecting subspaces in datasets. It also may serve an an anomaly detection technique, such that deviations from the normal subspaces may indicate anomalous instances. The principal components are the eigenvectors of the covariance matrix and thus their computation shares the same difficulties as CMGOS: Anomalies have a big influence on the covariance matrix and density estimation might be incorrect. Almost identical to the reduction technique of CMGOS, a robust version was proposed [58], which also computes the covariance matrix twice based on the Mahalanobis distance. Once the principal components are determined, the question is which components should be used to score anomalous instances. Using the major components show global deviations from the majority of the data whereas using minor components can indicate smaller local deviations. In our evaluation we follow different strategies, namely using all components, using major components only, using minor components only and finally using major and minor components while neglecting the ones in the middle [59]. Please note that there is a strong connection of rPCA and CMGOS-Red: When rPCA takes all components into consideration, the method is basically the same as applying CMGOS with setting k = 1.

主成分分析是检测数据集中子空间的常用技术。它还可以作为一种异常检测技术，使得偏离正常子空间可以指示异常情况。主成分是协方差矩阵的特征向量，因此它们的计算与CMGOS具有相同的困难：异常对协方差矩阵有很大的影响，并且密度估计可能是不正确的。与CMGOS的约简技术几乎相同，提出了一个健壮的版本[58]，它也基于马氏距离计算协方差矩阵两次。一旦确定了主成分，问题是应该使用哪些成分来对异常实例进行评分。使用主要组件显示了与大多数数据的整体偏差，而使用次要组件可以指示较小的局部偏差。在我们的评估中，我们遵循不同的策略，即使用所有组件，仅使用主要组件，仅使用次要组件，最后使用主要和次要组件，而忽略中间的组件[59]。请注意，rPCA和CMGOS-Red有很强的联系：当rPCA考虑所有组件时，方法基本上与应用设置k = 1的CMGOS相同。

Reviewing Algorithm Complexities and Implementation

Concerning the nearest-neighbor based algorithms with the exception of LOCI, the computational complexity of finding the nearest-neighbors is O(n2). The remaining computations, for example the density or LOF calculations, can be neglected in practice (less than 1% of runtime). Thus, all of these algorithms perform similar in terms of runtime. In our implementation, we used many optimizations so that the algorithms still perform well on large-scale datasets. In particular, first duplicates are removed from the data and a weight matrix is stored. This procedure might reduce the dataset size and thus save computation time. Memory consumption was reduced by storing only the top-k-neighbors during the search. Additionally, a smart parallelization technique was implemented depending on the number of dimensions. More information about the enhancements can be found in [50]. For LOCI, the computational complexity is O(n3) and the memory complexity is O(n2), which makes the algorithm practically too demanding for real-world applications. aLOCI on the contrary is faster and the runtime depends on the number of quad-trees to be used.

关于除LOCI之外的基于最近邻的算法，寻找最近邻的计算复杂度是O(n2)。剩余的计算，例如密度或LOF计算，在实践中可以忽略不计(运行时间的1%以下)。因此，所有这些算法在运行时间方面表现相似。在我们的实现中，我们使用了许多优化，因此算法在大规模数据集上仍然表现良好。特别地，从数据中移除第一副本，并存储权重矩阵。此过程可能会减小数据集的大小，从而节省计算时间。通过在搜索过程中只存储前k个邻居，减少了内存消耗。此外，根据维数实现了智能并行化技术。有关增强功能的更多信息，请参见[50]。对于LOCI，计算复杂度为O(n3)，内存复杂度为O(n2)，这使得该算法实际上对现实世界的应用要求太高。相反，aLOCI更快，运行时间取决于要使用的四叉树的数量。

Concerning the cluster-based methods (except for CMGOS-MCD), the main computational complexity is due to the clustering algorithm, which is typically faster than O(n2) if k-means is applied. In practice, when k-means is run several times in order to get stable clustering result, the runtime advance is reduced but still present. For large-scale datasets and big data, clustering-based methods have thus a performance advance compared to nearest-neighbor based methods. However, CMGOS-MCD is an exception here since the MCD computation is again quadratic for each cluster. Furthermore, as already mentioned, HBOS is even faster than the clustering-based algorithms and a good candidate for near real-time large-scale applications.

关于基于聚类的方法(除了CMGOS-MCD)，主要的计算复杂性是由于聚类算法，它通常比应用k-means的O(n2) 更快。在实践中，当k-means为了得到稳定的聚类结果而多次运行时，运行时间的提前被减少但仍然存在。对于大规模数据集和大数据，与基于最近邻的方法相比，基于聚类的方法具有更高的性能。然而，CMGOS-MCD在这里是一个例外，因为对于每个聚类，MCD计算也是二次的。此外，如前所述，HBOS甚至比基于聚类的算法更快，是近实时大规模应用的良好候选。

The complexity of the one-class SVM based algorithms is hard to determine since it depends on the number of support vectors and thus on the structure of the data. Furthermore, the applied gamma tuning of the SVMs has a huge impact on runtime since its computation has a quadratic complexity. Lastly, the complexity of rPCA is O(d2n + d3) and therefore depends heavily on the number of dimensions. If the number of dimensions is small, the algorithm competes in practice among the fastest algorithms in our trials. The source code of the algorithms and our optimizations are published as an open source plug-in (available at http://git. io/vnD6n) of the RapidMiner [60] data mining software.

基于一类SVM算法的复杂性很难确定，因为它取决于支持向量的数量，因此取决于数据的结构。此外，应用伽马调整的支持向量机有一个巨大的影响运行时间，因为它的计算有二次复杂性。最后，rPCA的复杂性是O(d2n + d3)，因此很大程度上取决于维数。如果维数很小，该算法实际上在我们的试验中竞争最快的算法。算法的源代码和我们的优化是作为一个开源插件发布的（http://git.io/vnD6n）。

Datasets for Benchmarking

Although unsupervised anomaly detection does not utilize any label information in practice, they are needed for evaluation and comparison. When new algorithms are proposed, it is common practice that an available public classification dataset is modified and the method is compared with the most known algorithms such as k-NN and LOF. There is a set of typically used datasets for classification, which are retrieved from UCI machine learning repository [61]. The typical preprocessing comprises of selecting one class as the anomalous class and sub-sampling some small amount of instances from that randomly. Unfortunately, the resulting datasets are hardly published and cannot be regenerated by other scientists. Since the number of anomalies is typically very low, a different subset might result in very different detection scores. To this end, we only found three different datasets available online [62, 63]. With this work we want to compensate this shortcoming and present and publish more different meaningful datasets, such that algorithms are better comparable in the future. Please note that some of the datasets have already been introduced previously in the first author’s Ph.D. Thesis [29] We also put a strong emphasis on a semantic background such the evaluation makes sense. This includes the two assumptions that (1) anomalies are rare and (2) are different from the norm. Additionally, we consider only point anomaly detection tasks as meaningful datasets for benchmarking since a different preprocessing might again lead to non-comparable results. For example, the dataset Poker Hand, which was used for evaluation before [37], is not used because the anomalies (special winning card decks) violate the assumption (2). Similarly, the use of the very popular KDD-Cup99 dataset needs special attention, which was originally used for benchmarking intrusion detection classification systems. Many attacks (anomalies) in the dataset define a collective anomaly detection problem and can thus not be used. Since the dataset is so popular, a point anomaly detection task was extracted as stated below.

尽管无监督异常检测在实践中不利用任何标签信息，但它们是评估和比较所需要的。当提出新的算法时，通常的做法是修改可用的公共分类数据集，并且将该方法与诸如k-NN和LOF等最公知的算法进行比较。有一组用于分类的典型数据集，这些数据集是从UCI机器学习存储库中检索的[61]。典型的预处理包括选择一个类作为异常类，并从其中随机抽取少量实例进行子采样。不幸的是，由此产生的数据集很难公布，也无法由其他科学家重新生成。由于异常的数量通常很少，不同的子集可能导致非常不同的检测分数。为此，我们只在网上找到了三个不同的数据集[62，63]。通过这项工作，我们希望弥补这一缺点，并提出和发布更多不同的有意义的数据集，以便算法在未来具有更好的可比性。请注意，一些数据集已经在第一作者的博士论文[29]中介绍过了。我们也非常强调语义背景，这样评估才有意义。这包括两个假设，即(1)异常很少，(2)不同于正常。此外，我们仅将点异常检测任务视为对基准测试有意义的数据集，因为不同的预处理可能会再次导致不可比较的结果。例如，在[37]之前用于评估的数据集扑克手没有被使用，因为异常(特殊的获胜纸牌)违反了假设(2)。同样，非常流行的KDD-库珀99数据集的使用需要特别注意，它最初用于入侵检测分类系统的基准测试。数据集中的许多攻击(异常)定义了一个集体异常检测问题，因此不能使用。由于数据集如此受欢迎，点异常检测任务被提取如下。

If the following, we describe the datasets and our preprocessing in more detail. All modifications have been made publicly available (http://dx.doi.org/10.7910/DVN/OPQMVF). A summary about the resulting dataset characteristics is given below in Table 1.

下面，我们更详细地描述数据集和我们的预处理。所有修改均已公开。下表1总结了结果数据集的特征。

表1。这10个数据集用于对来自不同应用领域的无监督异常检测算法进行比较评估。涵盖了大范围的大小、维度和异常百分比。它们的难度也不同，涵盖了本地和全局异常检测任务。

Breast Cancer Wisconsin (Diagnostic) The features of the breast-cancer dataset are extracted from medical images of a fine needle aspirate (FNA) describing the cell nuclei [64]. The task of the UCI dataset is to separate cancer from healthy patients. From the 212 malignant instances, we kept the first 10 as anomalies (similar to [46]). This results in a unsupervised anomaly detection dataset containing 367 instances in total and having 2.72% anomalies.

威斯康星乳腺癌诊断。该数据集的特征是从描述细胞核的细针抽吸(FNA)的医学图像中提取的[64]。UCI数据集的任务是将癌症与健康患者分开。从212个恶性实例中，我们保留了前10个异常(类似于[46])。这导致无监督异常检测数据集总共包含367个实例，并且具有2.72%的异常。

Pen-Based Recognition of Handwritten Text (global) This UCI dataset contains the handwritten digits 0–9 of 45 different writers, which we will use twice. Here, in the “global” task, we only keep the digit 8 as the normal class and sample the 10 digits from all of the other classes as anomalies. This results in one big normal cluster and global anomalies sparsely distributed. The resulting pen-global dataset has 16 dimensions and 809 instances including a large amount of anomalies (11.1%).

手写文本识别（全局）。该UCI数据集包含45个不同作者的手写数字0–9，我们使用两次，在“全局”任务中，我们只保留数字8作为正常类，从所有其他类中抽取10个数字作为异常。这导致一个大的正常簇和稀疏分布的全局异常。生成的笔全局数据集有16个维度和809个实例，包括大量异常（11.1%）。

Pen-Based Recognition of Handwritten Text (local) The previous dataset is used again, but now with a different preprocessing. All digits are kept, except for the digit 4. From this class, the first 10 instances are kept (similar to [46]). This results in a local anomaly detection task with clusters of different densities and 10 local anomalies, which we refer to as pen-local.

手写文本识别（局部）。再次使用该数据集，但使用不同的预处理。除数字4外，所有数字都保留。从这个类中，保留前10个实例（类似于[46]）。这导致了具有不同密度的簇和10个局部异常的局部异常检测任务，我们称之为笔局部。

Letter Recognition The UCI letter dataset contains originally 16 extracted features from the 26 letters of the English alphabet. This dataset has been preprocessed for unsupervised anomaly Table 1. The 10 datasets used for comparative evaluation of the unsupervised anomaly detection algorithms from different application domains. A broad spectrum of size, dimensionality and anomaly detection and was made publicly available [62]. Three letters have been chosen to form the normal class and anomalies have been sampled from the rest, which should result in a global anomaly detection task. The authors claim that the task is easy and they applied a procedure to make it more challenging: The number of dimensions was doubled to 32 by randomly concatenating normal instances of the three classes to all instances, including anomalies. This results in anomalies to have also some normal features making unsupervised anomaly detection more difficult. The resulting dataset has 1,600 instances including 6.25% anomalies.

字母识别。UCI字母数据集最初包含从英语字母表的26个字母中提取的16个特征。该数据集已经针对无监督异常进行了预处理。这10个数据集用于对来自不同应用领域的无监督异常检测算法进行比较评估。广泛的规模，维度和异常检测，并被公之于众[62]。已经选择了三个字母来形成正常类，并且已经从其余的中采样了异常，这将导致全局异常检测任务。作者声称这项任务很容易，他们应用了一个程序使它更具挑战性:通过将三个类的正常实例随机连接到所有实例，包括异常，维度的数量增加了一倍，达到32。这导致异常也具有一些正常特征，使得无监督的异常检测更加困难。结果数据集有1600个实例，包括6.25%的异常。

Speech Accent Data The speech dataset was also provided by [62] and contains real world data from recorded English language. The normal class contains data from persons having an American accent whereas the outliers are represented from seven other speakers, having a different accent. The feature vector is the i-vector of the speech segment, which is a state-of-theart feature in speaker recognition [65]. The dataset has 400 dimensions and is thus the task in our evaluation with the largest number of dimensions. It has 3,686 instances including 1.65% anomalies.

语音重音数据。语音数据集也由[62]提供，包含来自记录的英语语言的真实世界数据。正常类包含来自有美国口音的人的数据，而异常值来自其他七个有不同口音的说话者。特征向量是语音段的I向量，这是说话人识别中的一个重要特征[65]。数据集有400个维度，因此是我们评估中维度数量最多的任务。它有3686个实例，包括1.65%的异常。

Landsat Satellite The satellite dataset comprises of features extracted from satellite observations. In particular, each image was taken under four different light wavelength, two in visible light (green and red) and two infrared images. The task of the original dataset is to classify the image into the soil category of the observed region. We defined the soil classes “red soil”, “gray soil”, “damp gray soil” and “very damp gray soil” as the normal class. From the semantically different classes “cotton crop” and “soil with vegetation stubble” anomalies are sampled. After merging the original training and test set into a single dataset, the resulting dataset contains 5,025 normal instances as well as 75 randomly sampled anomalies (1.49%) with 36 dimensions.

陆地卫星。卫星数据集由从卫星观测中提取的特征组成。特别是，每幅图像都是在四种不同的光波长下拍摄的，两种是可见光(绿色和红色)图像，两种是红外图像。原始数据集的任务是将图像分类到观察区域的土壤类别中。我们将土壤类别定义为“红壤”、“灰土”、“潮湿灰土”和“非常潮湿灰土”作为正常类别。从语义不同的类别中，对“棉花作物”和“有植被茬的土壤”异常进行采样。将原始训练集和测试集合并为一个数据集后，得到的数据集包含5，025个正常实例以及75个随机采样的异常(1.49%)，具有36个维度。

Thyroid Disease The thyroid dataset is another dataset from UCI machine learning repository in the medical domain. The raw patient measurements contain categorical attributes as well as missing values such that it was preprocessed in order to apply neural networks [66], also known as the “annthyroid” dataset. We make also use of this preprocessing, resulting in 21 dimensions. Normal instances (healthy non-hypothyroid patients) were taken from the training and test datasets. From the test set, we sampled 250 outliers from the two disease classes (subnormal function and hyperfunction) resulting in a new dataset containing 6,916 records with 3.61% anomalies.

甲状腺疾病。甲状腺数据集是医学领域中来自UCI机器学习知识库的另一个数据集。原始患者测量值包含分类属性和缺失值，以便对其进行预处理，以应用神经网络[66]，也称为“人工甲状腺”数据集。我们也利用这个预处理，得到21个维度。正常实例(健康的非甲状腺功能减退患者)取自训练和测试数据集。从测试集中，我们从两个疾病类别(亚正常功能和高功能)中采样了250个异常值，产生了一个包含6916条记录的新数据集，异常率为3.61%。

Statlog Shuttle The shuttle dataset describes radiator positions in a NASA space shuttle with 9 attributes and was designed for supervised anomaly detection. Besides the normal “radiator flow” class, about 20% of the original data describe abnormal situations. To reduce the number of anomalies, we select the class 1 as normal and apply a stratified sampling for the classes 2, 3, 5, 6 and 7, similar to [67, 68]. Again, training and test set are combined in a single big dataset, which has as a result 46,464 instances with 1.89% anomalies.

航天飞机。该数据集描述了美国宇航局航天飞机中的散热器位置，具有9个属性，设计用于监督异常检测。除了正常的“暖气片流量”类，约20%的原始数据描述了异常情况。为了减少异常的数量，我们选择1类作为正常，并对2、3、5、6、7类应用分层抽样，类似于[67、68]。同样，训练集和测试集被组合在一个大数据集中，结果是46464个实例有1.89%的异常。

Object Images (ALOI) The aloi dataset is derived from the “Amsterdam Library of Object Images” collection [63]. The original dataset contains about 110 images of 1000 small objects taken under different light conditions and viewing angles. From the original images, a 27 dimensional feature vector was extracted using HSB color histograms [38]. Some objects were chosen as anomalies and the data was down-sampled such that the resulting dataset contains 50,000 instances including 3.02% anomalies.

物体图像(ALOI)。ALOI数据集来自“阿姆斯特丹物体图像图书馆”收藏[63]。原始数据集包含1000个小物体在不同光照条件和视角下拍摄的大约110幅图像。从原始图像中，使用HSB颜色直方图提取27维特征向量[38]。选择一些对象作为异常，并对数据进行降采样，以使结果数据集包含50，000个实例，包括3.02%的异常。

KDD-Cup99 HTTP As mentioned earlier, the kdd99 dataset is often used for unsupervised anomaly detection. Similar to the shuttle dataset, this artificially created dataset was also designed for supervised anomaly detection. It basically contains simulated normal and attack traffic on an IP level in a computer network environment in order to test intrusion detection systems. In the past, the dataset was sometimes used by just sampling randomly from the attacks. Due to the nature of some attacks, for example DDoS, this represents not a point anomaly detection problem. To serve for our unsupervised evaluation purpose best, we decided to use HTTP traffic only (similar to [37]) and also limit DoS traffic from the dataset (similar to [69]). To this end, only 500 instances from these attacks are kept. This ensures that we do not have larger clusters among the anomalies. Furthermore, some of the features were adopted: First, “protocol” and “port” information were removed, since we select HTTP traffic only. The categorical “flags” feature was also removed and the remaining binary categorical features represented as 0 or 1 resulting in a total of 38 dimensions. The large-scale dataset contains finally 620,089 instances with only 0.17% anomalies. To our knowledge, this is the largest dataset in terms of instances used so far for unsupervised anomaly detection.

如前所述，kdd99数据集通常用于无监督的异常检测。类似于航天飞机数据集，这个人工创建的数据集也是为监督异常检测而设计的。它基本上包含计算机网络环境中IP级别的模拟正常流量和攻击流量，以测试入侵检测系统。过去，数据集有时仅通过从攻击中随机取样来使用。由于某些攻击的性质，例如DDoS，这不代表点异常检测问题。为了最好地服务于我们的无监督评估目的，我们决定只使用HTTP流量(类似于[37])，并且还限制来自数据集的DoS流量(类似于[69])。为此，只保留了这些攻击的500个实例。这确保了我们在异常中没有更大的聚类/簇。此外，还采用了一些特性：首先，删除了“协议”和“端口”信息，因为我们只选择了HTTP流量。分类“标志”特征也被移除，剩余的二进制分类特征表示为0或1，从而导致总共38个维度。大规模数据集最终包含620089个实例，仅有0.17%的异常。据我们所知，就目前用于无监督异常检测的实例而言，这是最大的数据集。

Dataset Summary

All dataset characteristics are summarized in Table 1. With our dataset selection, we cover a broad spectrum of application domains including medical applications, intrusion detection, image and speech recognition as well as the analysis of complex systems. Additionally, the datasets cover a broad range of properties with regard to dataset size, outlier percentage and dimensionality. To our knowledge, this is the most comprehensive collection of unsupervised anomaly detection datasets for algorithm benchmarking. As already stated, we published the datasets to encourage researchers to compare their proposed algorithms with this work and hope to establish an evaluation standard in the community.

表1总结了所有数据集特征。通过我们的数据集选择，我们涵盖了广泛的应用领域，包括医疗应用、入侵检测、图像和语音识别以及复杂系统的分析。此外，数据集涵盖了关于数据集大小、异常百分比和维度的广泛属性。据我们所知，这是用于算法基准测试的无监督异常检测数据集的最全面的集合。如前所述，我们公布了数据集，以鼓励研究人员将他们提出的算法与这项工作进行比较，并希望建立一个评估标准。

Comparative Evaluation

Comparing the anomaly detection performance of unsupervised anomaly detection algorithms is not as straight forward as in the classical supervised classification case. In contrast to simply compare an accuracy value or precision/recall, the order of the anomalies should be taken into account. In classification, a wrongly classified instance is for sure a mistake. This is different in unsupervised anomaly detection. For example, if a large dataset contains ten anomalies and they are ranked among the top-15 outliers, this is still a good result, even if it is not perfect. To this end, a common evaluation strategy for unsupervised anomaly detection algorithms is to rank the results according to the anomaly score and then iteratively apply a threshold from the first to the last rank. This results in N tuple values (true positive rate and false positive rate), which form a single receiver operator characteristic (ROC). Then, the area under the curve (AUC), the integral of the ROC, can be used as a detection performance measure. A nice interpretation of the AUC is also given when following the proof from [70] and transform it into the anomaly detection domain: The AUC is then the probability that an anomaly detection algorithm will assign a randomly chosen normal instance a lower score than a randomly chosen anomalous instance. Hence, we think the AUC is a perfect evaluation method and ideal for comparison. However, the AUC only takes the ranking into account and completely neglects the relative difference of the scores among each other. Other measures can be used to cope with this shortcoming by using more sophisticated rank comparisons. Schubert et al. [38] compares different rank correlation methodologies, for example Spearman’s ρ and Kendall’s τ as an alternative to AUC with a focus on targeting outlier ensembles. A second possible drawback of using AUC might be that it is not ideal for unbalanced class problems and methods like area under precision-recall curve or Matthews correlation coefficient could possibly better emphasize small detection performance changes. Nevertheless, AUC based evaluation has been evolved to be the de facto standard in unsupervised anomaly detection, most likely due to its practical interpretability, and thus also serves as the measure of choice in our evaluation.

比较无监督异常检测算法的异常检测性能不像在经典有监督分类情况下那样简单。与简单地比较精确度值或精确度/召回率相比，异常的顺序应该被考虑在内。在分类中，错误分类的实例肯定是错误的。这在无监督的异常检测中是不同的。例如，如果一个大数据集包含十个异常，并且它们被排在前15个异常值中，这仍然是一个好结果，即使它不是完美的。为此，无监督异常检测算法的常见评估策略是根据异常分数对结果进行排序，然后从第一个到最后一个排序迭代地应用阈值。这导致N个元组值(真阳性率和假阳性率)，它们形成单个接收器操作符特征(ROC)。然后，曲线下面积(AUC)，即ROC的积分，可以用作检测性能的度量。当遵循[70]中的证明并将其转换到异常检测域时，AUC也给出了一个很好的解释:AUC是异常检测算法将随机选择的正常实例分配给比随机选择的异常实例更低的分数的概率。因此，我们认为AUC是一种完美的评估方法，也是比较理想的方法。然而，AUC只考虑了排名，完全忽略了分数之间的相对差异。通过使用更复杂的等级比较，可以使用其他方法来解决这个缺点。舒伯特等人[38]比较了不同的秩相关方法，例如斯皮尔曼的ρ和肯德尔的τ作为AUC的替代方法，重点是针对离群系综。使用AUC的第二个可能的缺点可能是它对于不平衡的类问题不理想，并且像精确召回曲线下的面积或马修斯相关系数这样的方法可能更好地强调小的检测性能变化。然而，基于AUC的评估已经发展成为无监督异常检测中事实上的标准，很可能是由于其实际的可解释性，因此也作为我们评估中的选择标准。

Please note that the AUC, when it is used in a traditional classification task, typically involves a parameter, for example k, to be altered. In unsupervised anomaly detection, the AUC is computed by varying an outlier threshold in the ordered result list. As a consequence, if a parameter has to be evaluated (for example different k), this yields to multiple AUC values. Another important question in this context is how to evaluate k, the critical parameter for most of the nearest-neighbor and clustering-based algorithms. In most publications, researchers often fix k to a predefined setting or choose “a good k” depending on a favorite outcome. We believe that the latter is not a fair evaluation, because it somehow involves using the test data (the labels) for training. In our evaluation, we decided to evaluate many different k’s between 10 and 50 and finally report the averaged AUC as well as the standard deviation. In particular, for every k, the AUC is computed first by varying a threshold among the anomaly scores as described above. Then, the mean and standard deviation for all these AUCs is reported. This procedure basically corresponds to a random-k-picking strategy within the given interval, which is often used in practice when k is chosen arbitrarily. The lower bound of 10 was chosen because of statistical fluctuations occurring below. For the upper bound, we set a value of 50 such that it is still suitable for our smaller datasets. We are aware, that one could argue to increase the upper bound for the larger datasets or even make it smaller for the small datasets like breast-cancer. Fig 8 shows a broader evaluation of the parameter k illustrating that our lower and upper bound is in fact useful.

请注意，当AUC用于传统的分类任务时，它通常涉及要更改的参数，例如k。在无监督异常检测中，通过改变有序结果列表中的异常阈值来计算AUC。因此，如果一个参数必须被评估，例如不同的k，这会产生多个AUC值。这方面的另一个重要问题是如何评估k，k是大多数最近邻和基于聚类的算法的关键参数。在大多数出版物中，研究人员通常将k固定在预先定义的设置上，或者根据喜欢的结果选择“好k”。我们认为后者不是一个公平的评估，因为它不知何故涉及到使用测试数据（标签）进行训练。在我们的评估中，我们决定评估10到50之间的许多不同的k，并最终报告平均AUC和标准偏差。特别地，对于每k，首先通过如上所述改变异常分数中的阈值来计算AUC。然后，报告所有这些AUC的平均值和标准偏差。这个过程基本上对应于一个给定区间内的随机k-picking策略，在实践中当任意选择k时经常用到。选择下限10是因为下面出现了统计波动。对于上限，我们将值设置为50，这样它仍然适用于我们的较小数据集。我们知道，有人可能会主张增加较大数据集的上限，或者甚至缩小乳腺癌等小数据集的上限。图8显示了参数k的更广泛的评估，说明我们的下限和上限实际上是有用的。

Results of the Nearest-neighbor based Algorithms

图8。乳腺癌数据集上基于最近邻算法的AUC值。可以看出，小于10的k值往往导致较差的估计，尤其是在考虑局部异常检测算法时。请注意，AUC轴在0.5处截止。

Table 2 shows the results of the nearest-neighbor based algorithms. The AUC values are averaged for 10≥k≥5010\ge k \ge 5010≥k≥50 and the standard deviation is shown. For LOCI, which does not rely on k, only one result is available. As already stated, LOCI is very computationally intensive and could not be computed within a reasonable time for larger datasets. Since aLOCI is not a deterministic algorithm, it was run 20 times and the average result was taken. For both, LOCI and aLOCI, the recommended parameter settings from the authors were used [47]. When comparing aLOCI with the other algorithms, the results are significantly worse: It is often the worst performing algorithm and the high standard deviation of the results additionally shows less stability of the results. Thus, the use of LOCI and aLOCI is at least questionable in practice. In this context, recall that it is not possible on unlabeled data to determine whether a non-deterministic aLOCI outcome is good or not for a practical application. Furthermore, it can be observed from the table that the results of the two k-NN and LOF variants do not differ much and there is no clear winner or recommendation. At least for LOF, this result was not expected, since the authors claimed that the LOF-UB ensemble performs better in general [43].

表2显示了基于最近邻的算法的结果。AUC值平均为 10≥k≥5010\ge k \ge 5010≥k≥50，并显示标准偏差。对于不依赖k的LOCI，只有一个结果可用。如前所述，LOCI计算量很大，无法在合理的时间内对较大的数据集进行计算。由于aLOCI不是一个确定性算法，所以它运行了20次，并获得了平均结果。对于LOCI和aLOCI，使用了作者推荐的参数设置[47]。当将aLOCI与其他算法进行比较时，结果明显更差:它通常是性能最差的算法，并且结果的高标准偏差还显示出结果的不稳定性。因此，LOCI和aLOCI的使用至少在实践中是有问题的。在这种情况下，回想一下，在未标记的数据上，不可能确定非确定性aLOCI结果对于实际应用是否良好。此外，从表中可以看出，两个k-NN和LOF变体的结果没有太大差异，也没有明确的获胜者或推荐。至少对LOF来说，这个结果是意料之外的，因为作者声称LOF-UB合奏总体上表现更好[43]。

Another very important finding from our evaluation can be inferred when comparing the two columns of the pen-global/local datasets. It can be seen that the local anomaly detection algorithms perform much worse on the global anomaly detection task. Also, the same observation could be made on the (global) shuttle and kdd99 dataset. For the latter, Fig 9 illustrates the superiority of k-NN compared to the local algorithms. A final observation is the general poor performance of all algorithms on the high-dimensional speech dataset. An AUC of 0.5 shows that the detection performance is as good as a random guess. When we looked into the results in more detail, we could observe that the performance for very small k values is much better (for almost all algorithms). The k values of 2, 3 and 4 show AUCs of up to 0.78 with a quick drop when k is larger. We suspect that due to the high number of dimensions, the curse of dimensionality leads to poor results for k > 5.

当比较笔全局/局部数据集的两列时，可以从我们的评估中推断出另一个非常重要的发现。可以看出，局部异常检测算法在全局异常检测任务上表现较差。此外，在(全球)航天飞机和kdd99数据集上也可以进行同样的观察。对于后者，图9说明了k-神经网络相比于局部算法的优越性。最后一个观察是所有算法在高维语音数据集上的总体较差性能。AUC为0.5表明检测性能与随机猜测一样好。当我们更详细地查看结果时，我们可以观察到非常小的k值的性能要好得多(对于几乎所有的算法)。k值为2、3和4时，AUC最高可达0.78，当k值较大时，AUC迅速下降。我们怀疑由于维数高，维数的诅咒导致k > 5的结果差。

表2 .最近邻算法的结果显示了所有10个数据集的AUC和 10≤k≤5010 \le k \le 5010≤k≤50 的标准偏差。由于计算的复杂性，LOCI无法针对更大的数据集进行计算。

Results of the Clustering-based Algorithms

Table 3 summarizes the results for the clustering-based anomaly detection algorithms. For every algorithm, we used the parameter settings as recommended by the authors as a default: The parameters separating into small/large clusters for CBLOF and uCBLOF are α = 95, β = 5, and for LDCOF and CMGOS γ = 0.3. Additionally, for CMGOS the estimated amount of normal instances is pn= 0.975. In order to make the clustering-based algorithm evaluation as comparable as possible, we used the same clustering result for every algorithm. For example, for k = 10 we applied k-means and stored the resulting centroids and cluster belongings. Then, all algorithms use this result as a basis for computing their scores. Since k-means is also a nondeterministic algorithm, we ran it 10 times on the same data and follow a common strategy by picking the most stable clustering result.

表3总结了基于聚类的异常检测算法的结果。对于每一种算法，我们使用作者推荐的参数设置作为默认值:对于CBLOF和uCBLOF，分成小/大簇的参数是α = 95, β=5, 对于LDCOF和CMGOS，γ = 0.3。此外，对于CMGOS，正常实例的估计数量为pn= 0.975。为了使基于聚类的算法评估尽可能具有可比性，我们对每个算法使用相同的聚类结果。例如，对于k = 10，我们应用了k-means，并存储了结果质心和聚类/簇的所有物。然后，所有算法都使用这个结果作为计算分数的基础。由于k-means也是一个不确定的算法，我们在相同的数据上运行了10次，并遵循一个共同的策略，选择最稳定的聚类结果。

The results show that CBLOF performs poorly in most cases. Especially on the smaller datasets, the algorithm is even worse than random guessing. Since this behavior is very suspicious, we looked into the results in detail: AUCs from 0.10 to 0.94 occurred, but no correlation to k could be found. Our suspicion concerning the results is again the possible flaw of weighting the scores by number of members in the cluster—especially on small datasets the influence seems disadvantageous. Simply removing the weighting yields to much better results, proven by the results of uCBLOF. Similar to our observation on the nearest-neighbor based algorithms, again local methods tend to perform worse on global anomaly detection tasks. Concerning the different robust estimations of the covariance matrix for CMGOS, two techniques seem to perform well: GMGOS-Red as well as CMGOS-MCD. Due to the much higher computational complexity, we recommend to use CMGOS-Red. The high number of dimensions in the speech dataset causes CMGOS-MCD to not complete within a reasonable amount of time.

结果表明，CBLOF在大多数情况下表现不佳。尤其是在较小的数据集上，该算法甚至比随机猜测还要糟糕。由于这种行为非常可疑，我们详细研究了结果：出现了从0.10到0.94的AUC，但找不到与k的相关性。我们对结果的怀疑也是根据聚类中成员的数量来加权分数的可能缺陷——特别是在小数据集上，这种影响似乎是不利的。简单地移除权重会产生更好的结果，uCBLOF的结果证明了这一点。与我们对基于最近邻的算法的观察类似，局部方法在全局异常检测任务上的表现也更差。关于CMGOS协方差矩阵的不同稳健估计，两种技术似乎表现良好:GMGOS-Red和CMGOS-MCD。由于计算复杂度高得多，我们推荐使用CMGOS-Red。语音数据集中的高维数导致CMGOS-MCD不能在合理的时间内完成。

Special attention should be payed on the last row of the table, where the best nearest-neighbor method is listed for comparison. It can be seen that nearest-neighbor based unsupervised anomaly detection performs better in general. However, a much more severe issue is from our point of view the reliability indicated by the corresponding standard deviations. In practice, when the parameter k cannot be determined and needs to be fixed to a certain value, nearestneighbor based algorithms generate much more stable results.

应特别注意表格的最后一行，其中列出了最佳最近邻法进行比较。可以看出，基于最近邻的无监督异常检测总体上表现更好。然而，从我们的角度来看，一个更严重的问题是由相应的标准差指示的可靠性。在实践中，当参数k无法确定并且需要固定为某个值时，基于最近邻的算法会产生更稳定的结果。

图9。0<k<100的大kdd99数据集的AUC值。很容易看出，对于这种全局异常检测挑战，局部异常检测算法的性能很差。

表3 .基于聚类算法的结果显示了不同初始k (10≤k≤5010\le k \le 5010≤k≤50)。最后一行显示了与数据集的最佳最近邻方法的比较。

Results of other Algorithms

In the remaining, the four algorithms are evaluated which do not belonging to one of the groups above. Table 4 shows the results for the statistical HBOS, the subspace rPCA algorithm as well as the results for the two one-class SVM methods. The evaluation of HBOS is performed similar to the previous algorithms, whereas here k refers to the number of bins used.

在剩下的算法中，评估不属于上述组之一的四种算法。表4显示了统计HBOS、子空间rPCA算法以及两种一类SVM方法的结果。HBOS的评估类似于前面的算法，而这里的k指的是使用的箱数。

The constraint 10≤k≤5010 \le k \le 5010≤k≤50 was also used for HBOS. rPCA has no critical parameter k to be evaluated, but as described earlier, a different amount of components can be used. In total, we applied four different strategies: (1) Using all components, (2) Using major components only, (3) Using minor components only and (4) Using major and minor while neglecting the ones in the middle. We found that the results of the four strategies are very similar (for some datasets even identical) and therefore reported the averaged AUC in the table. The parameters ν for the one-class SVM as well as β for the enhanced η one-class SVM were also altered in the range 0.2≤ν≤0.80.2 \le ν \le 0.80.2≤ν≤0.8 and the average AUC was reported. Choosing these parameters seems less critical than choosing a correct k for other algorithms—it seems that a setting of β/ν = 0.5 is a good choice on average. However, the parameter value should not be set too small to avoid an incorrect density estimation. Furthermore, a Gaussian kernel with automatic gamma tuning was used. Tuning this parameter automatically is ideal for unsupervised anomaly detection, but requires an significant amount of computation time.

约束 10≤k≤5010 \le k \le 5010≤k≤50 也用于HBOS。rPCA没有要评估的关键参数k，但是如前所述，可以使用不同数量的组件。总之，我们应用了四种不同的策略:(1)使用所有组件，(2)仅使用主要组件，(3)仅使用次要组件，(4)使用主要和次要组件，而忽略中间的组件。我们发现四种策略的结果非常相似(对于一些数据集甚至是相同的)，因此在表格中报告了平均AUC。一级SVM的参数ν以及增强型η一级SVM的参数β也在 0.2≤ν≤0.80.2 \le ν \le 0.80.2≤ν≤0.8，并报告平均AUC。选择这些参数似乎没有为其他算法选择正确的k那么重要——β/ν=0.5 的设置似乎是一个不错的选择。但是，参数值不应设置得太小，以免密度估计不正确。此外，使用了具有自动伽马调整的高斯核。自动调整该参数对于无监督的异常检测是理想的，但是需要大量的计算时间。

The results of the four algorithms are very diverse. For us, the biggest surprise was the good performance of HBOS on the larger (global) datasets. For kdd99, the result is almost perfect while on the thyroid dataset, it outperformed all the other algorithms by far. The other algorithm with a rather simple (linear) model, rPCA, performed average. One exception is the shuttle dataset, where rPCA obtains together with HBOS the best results. One-class SVMs turned out to be not outstanding algorithms and results are average. When comparing the enhanced η one-class SVM with the regular one, the latter seems to perform better.

四种算法的结果差异很大。对我们来说，最大的惊喜是HBOS在更大(全球)数据集上的良好性能。对于kdd99，结果几乎是完美的，而在甲状腺数据集上，它远远优于所有其他算法。另一个算法是一个相当简单的(线性)模型，rPCA，执行平均。一个例外是航天飞机数据集，rPCA与HBOS一起获得最佳结果。一类支持向量机原来不是优秀的算法，结果一般。当比较增强型η1级SVM和常规型时，后者似乎表现更好。

表4 .剩余非监督异常检测算法的AUC结果。rPCA采用了四种不同的保存组件策略，而HBOS则改变了不同容器的数量。

Computation Time Comparison of Algorithms

If determinable, the theoretical computational complexity of the evaluated algorithms was already discussed. However, in practice, the actual computation times may still be quite different from each other. For this reason, the computation times were measured and are listed in Table 5. Please note that the listed times are measured in seconds for the first nine datasets and in minutes for the last column, the large kdd99 dataset. The time was measured on a single thread basis. For the clustering-based algorithms, the computation time depends strongly on the number of clusters. For that reason, the times for 10 and 50 clusters are listed separately for each algorithm.

如果可以确定，已经讨论了评估算法的理论计算复杂性。然而，在实践中，实际计算时间可能仍有很大差异。为此，测量了计算时间，并在表5中列出。请注意，前九个数据集的列出时间以秒为单位，最后一列(大kdd99数据集)以分钟为单位。时间是在单线程的基础上测量的。对于基于聚类的算法，计算时间很大程度上取决于聚类的数量。因此，每个算法分别列出了10个和50个聚类/簇的时间。

Except for the very demanding LOCI algorithm, it can be seen that the computation for the small datasets is sufficiently fast, such that the choice of an appropriate algorithm should focus on detection performance, not on runtime. On the contrary, for large datasets, computation time differences are significant. For example, for the largest dataset kdd99, the fastest algorithm HBOS took less than 4 seconds, whereas the slowest GMGOS-MCD took more than 6 days.

除了要求非常高的LOCI算法之外，可以看出，对于小数据集的计算足够快，因此选择合适的算法应该关注检测性能，而不是运行时间。相反，对于大型数据集，计算时间差异很大。例如，对于最大的数据集kdd99，最快的算法HBOS用时不到4秒，而最慢的GMGOS-MCD用时超过6天。

In general, it can be observed that nearest-neighbor based algorithms have almost identical runtimes. This is due to the fact, that the nearest-neighbor search is responsible for most of the computation time, whereas the (different) computation of the scores itself has almost no influence. Furthermore, it can be confirmed that the clustering-based algorithms (except for CMGOS-MCD) are faster than the nearest-neighbor based algorithms with the quadratic search complexity. Please keep in mind that the time includes the execution of ten different runs of the underlying k-means algorithm. At this point, we would like to state again, that the use of CMGOS-MCD is not recommended. HBOS is by far the fastest algorithm among all, which is due to its very simple idea of assuming independence of the features. The comparable high runtimes of the SVM based algorithms are mainly based on the automatic gamma tuning, which has a quadratic complexity. For example, for the aloi dataset, the gamma tuning takes about 16 hours, whereas the core SVM training is only 30 seconds for the η one-class SVM and 16 minutes for the regular one-class SVM.

一般来说，可以观察到基于最近邻的算法具有几乎相同的运行时。这是因为最近邻搜索占用了大部分计算时间，而分数的(不同)计算本身几乎没有影响。此外，可以确定的是，基于聚类的算法(除了CMGOS-MCD)比基于最近邻的算法更快，并且具有二次搜索复杂度。请记住，该时间包括底层k-means算法的十个不同运行的执行。在这一点上，我们想再次声明，不建议使用CMGOS-MCD。HBOS是迄今为止最快的算法，这是因为它假设特征独立的非常简单的思想。基于SVM的算法的相当高的运行时间主要基于自动伽马调参，其具有二次复杂度。例如，对于aloi数据集，伽马校正需要大约16个小时，而核心SVM训练对于η1级SVM只需要30秒，对于常规1级SVM只需要16分钟。

表5 .比较不同算法的计算时间显示出巨大的差异，尤其是对于较大的数据集。表的单位是前九列的秒和最后一个数据集(kdd99)的分钟。

Conclusion

A comprehensive evaluation of 19 different unsupervised anomaly detection algorithms on 10 datasets from different application domains has been performed for the first time. The algorithms have been released as an open source extension for the RapidMiner data mining software (available at http://git.io/vnD6n). Additionally, the datasets have been made publicly available (http://dx.doi.org/10.7910/DVN/OPQMVF) and therefore a foundation for a fair and standardized comparison for the community was introduced. Besides supporting the unsupervised anomaly detection research community, we also believe that our study and our implementation is useful for researchers from neighboring fields. Now, it is easy to apply the discussed methods on new data. The broad variety of our evaluation datasets might guide for appropriate algorithm selection in new application domains.

首次对来自不同应用领域的10个数据集上的19种不同的无监督异常检测算法进行了综合评估。这些算法已经作为RapidMiner数据挖掘软件的开源扩展发布（http://git.io/vnD6n）。此外，数据集已经公开（http://dx.doi.org/10.7910/DVN/OPQMVF），因此引入了一个公平和标准化的社区比较基础。除了支持无监督异常检测研究社区，我们还相信我们的研究和实现对来自邻近领域的研究人员是有用的。现在，将所讨论的方法应用于新数据是很容易的。我们广泛多样的评估数据集可能为新应用领域的适当算法选择提供指导。

Table 6. Recommendations for algorithm selection. Qualitatively judgments are given from very bad (− −) over average (o) to very good (++).

In particular, our findings include that local anomaly detection algorithms, such as LOF, COF, INFLO and LoOP tend to perform poorly on datasets containing global anomalies by generating many false positives. The usage of these algorithms should be avoided if it is known that the task is to detect global anomalies only. On the contrary, global anomaly detection algorithms perform at least average on local problems. This yields in our recommendation to select a global anomaly detection algorithm if there is no further knowledge about the nature of anomalies in the dataset to be analyzed.

特别是，我们的发现包括局部异常检测算法，如LOF、COF、YUM和LoOP，往往在包含全局异常的数据集上表现不佳，产生许多假阳性。如果知道任务只是检测全局异常，则应避免使用这些算法。相反，全局异常检测算法至少在局部问题上表现一般。这产生了我们的建议，如果没有关于待分析数据集中异常性质的进一步知识，则选择全局异常检测算法。

As a general detection performance result, we can conclude that nearest-neighbor based algorithms perform better in most cases when compared to clustering algorithms. Also, the stability concerning a not-perfect choice of k is much higher for the nearest-neighbor based methods. The reason for the higher variance in clustering-based algorithms is very likely due to the non-deterministic nature of the underlying k-means clustering algorithm. Despite of this disadvantage, clustering-based algorithms have a lower computation time. As a conclusion, we recommend to prefer nearest-neighbor based algorithms if computation time is not an issue. If a faster computation is required for large datasets, for example in a near real-time setting, clustering-based anomaly detection might be the method of choice. For small datasets, clusteringbased methods should be avoided.

作为一般的检测性能结果，我们可以得出结论，与聚类算法相比，基于最近邻的算法在大多数情况下表现更好。此外，对于基于最近邻的方法，关于k的非完美选择的稳定性要高得多。基于聚类的算法中方差较高的原因很可能是由于底层k-均值聚类算法的不确定性。尽管有这个缺点，基于聚类的算法有一个较低的计算时间。总之，如果计算时间不是问题，我们建议优先选择基于最近邻的算法。如果大型数据集需要更快的计算速度，例如在接近实时的环境中，基于聚类的异常检测可能是首选方法。对于小数据集，应避免使用基于聚类的方法。

Among the nearest-neighbor based methods, the global k-NN algorithm is a good candidate on average. Although LoOP was the best performing nearest-neighbor based algorithm on four datasets, it unfortunately fails significantly one some datasets. Especially for the global anomaly detection problems this algorithm should be totally avoided. Besides our recommendation for k-NN, LOF is also a good candidate if it is previously known that the anomaly detection problem to be solved involves local anomalies.

在基于最近邻的方法中，全局k-NN算法平均来说是一个很好的候选算法。虽然LoOP是四个数据集上性能最好的基于最近邻的算法，但不幸的是，它在一些数据集上明显失败。特别是对于全局异常检测问题，该算法应该完全避免。除了我们推荐的k-NN之外，如果事先知道要解决的异常检测问题涉及局部异常，LOF也是一个很好的候选对象。

Concerning the clustering-based algorithms, the simple uCBLOF algorithm also shows on average good performance for all datasets, illustrating that a more sophisticated and compute intense density estimation is not necessarily required. In terms of computational complexity, clustering-based algorithms are faster than their nearest-neighbor competitors. However, in practice, we advice to restart the underlying k-means algorithm multiple times in order to obtain a stable clustering outcome. This procedure unfortunately often takes away the advantage of the theoretical speedup, which leads on the small datasets even to longer runtimes compared with the nearest-neighbor based algorithms. Nevertheless, when processing speed is very important or a clustering model can be updated in a data streaming application, a clusteringbased algorithm might be used. Besides our recommendation for uCBLOF, CMGOS-Reg also seems to perform reliable on most of the datasets. On the contrary, the original CBLOF algorithm should be avoided due to an algorithm design flaw. Also, the CMGOS with the subspacebased MCD density estimation should not be the first choice, since the density estimation is too slow and detection performance is worse.

关于基于聚类的算法，简单的uCBLOF算法也显示出对所有数据集的平均良好性能，说明不一定需要更复杂和计算密集的密度估计。就计算复杂度而言，基于聚类的算法比其最近邻竞争对手更快。然而，在实践中，我们建议重新启动底层的k-means算法多次，以获得稳定的聚类结果。不幸的是，这一过程常常会丧失理论上的加速优势，这导致在小数据集上，与基于最近邻的算法相比，运行时间甚至更长。然而，当处理速度非常重要或者可以在数据流应用程序中更新聚类模型时，可以使用基于聚类的算法。除了我们对uCBLOF的建议，CMGOS-Reg似乎在大多数数据集上都表现可靠。相反，由于算法设计缺陷，应该避免使用原始的CBLOF算法。此外，具有基于子空间的多目标检测密度估计的多目标检测系统不应该是第一选择，因为密度估计太慢并且检测性能较差。

The statistical algorithm HBOS, which assumes independence of the features, surprisingly showed very good results in our evaluation. It is even the best performing algorithm on four out of our 10 datasets. Due to the very fast computation time, especially for large datasets, we highly recommend to give it a try on large-scale datasets when looking for global anomalies.

假设特征独立的统计算法HBOS在我们的评估中令人惊讶地显示出非常好的结果。它甚至是我们10个数据集中4个数据集上表现最好的算法。由于计算时间非常快，尤其是对于大型数据集，我们强烈建议在寻找全局异常时在大规模数据集上尝试一下。

One dataset with 400 dimensions was a big challenge for all of the algorithms, most likely due to the curse of dimensionality. In this context, only nearest-neighbor based algorithms with a very small k < 5 were useful at all. Since in unsupervised anomaly detection k can typically not be determined, we might conclude that unsupervised anomaly detection fails on such a high number of dimensions.

一个400维的数据集对所有算法来说都是一个巨大的挑战，很可能是由于维数灾难。在这种情况下，只有 k<5 的最近邻算法才是有用的。因为在无监督异常检测中，k通常不能确定，所以我们可以得出结论，无监督异常检测在如此高的维数上失败。

As a general summary for algorithm selection, we recommend to use nearest-neighbor based methods, in particular k-NN for global tasks and LOF for local tasks instead of clustering-based methods. If computation time is essential, HBOS is a good candidate, especially for larger datasets. A special attention should be paid to the nature of the dataset when applying local algorithms, and if local anomalies are of interest at all in this case. We have summarized our recommendations for algorithm selection in Table 6 with respect to the anomaly detection performance (accuracy), the stability of the scoring (deterministic), the sensitivity to parameters, the computation time for larger datasets (speed) and whether the algorithm is applicable for datasets having global anomalies only. Please note, that the judgments in the table assume that the general recommendations as given above are followed.

作为算法选择的一般总结，我们建议使用基于最近邻的方法，特别是用于全局任务的k-NN和用于局部任务的LOF，而不是基于聚类的方法。如果计算时间至关重要，HBOS是一个很好的选择，尤其是对于较大的数据集。应用局部算法时，应特别注意数据集的性质，以及在这种情况下是否对局部异常感兴趣。我们在表6中总结了我们对算法选择的建议，涉及异常检测性能(准确性)、评分的稳定性(确定性)、对参数的敏感性、较大数据集的计算时间(速度)以及算法是否仅适用于具有全局异常的数据集。请注意，表中的判断假设遵循了上面给出的一般建议。

Acknowledgments

This research is supported by The Japan Science and Technology Agency (JST) through its “Center of Innovation Science and Technology based Radical Innovation and Entrepreneurship Program (COI Program)”.

Author Contributions

Conceived and designed the experiments: MG SU. Performed the experiments: MG. Analyzed the data: MG SU. Contributed reagents/materials/analysis tools: MG. Wrote the paper: MG SU.

【Paper】A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data相关推荐

【论文阅读】TranAD: Deep Transformer Networks for Anomaly Detection inMultivariate Time Series Data
TranAD架构模型构建: = {
【时序异常检测翻译】1.DeepAnT: A Deep Learning Approach for Unsupervised Anomaly Detection in Time Series
DeepAnT:一种用于时间序列中无监督异常检测的深度学习方法摘要传统的基于距离和密度的异常检测技术无法检测流数据中常见的周期性和季节性点异常,在当前物联网时代,时间序列异常检测存在很大差距.为了 ...
【Paper】Deep Learning for Anomaly Detection:A survey
论文原文:PDF 论文年份:2019 论文被引:253(2020/10/05) 922(2022/03/26) 文章目录 ABSTRACT 1 Introduction 2 What are anom ...
【Paper】2021_Distributed Consensus Tracking of Networked Agent Systems Under Denial-of-Service Attack
Y. Wan, G. Wen, X. Yu and T. Huang, "Distributed Consensus Tracking of Networked Agent Systems ...
【Paper】2017_水下潜航器编队海洋勘测的协调控制方法研究
友情链接:[paper]2019_Consensus Control of Multiple AUVs Recovery System Under Switching Topologies and T ...
【Paper】2019_Consensus Control of Multiple AUVs Recovery System Under Switching Topologies and Time D
Zhang W, Zeng J, Yan Z, et al. Consensus control of multiple AUVs recovery system under switching to ...
【Paper】2009_Controllability of Multi-Agent Systems from a Graph-Theoretic Perspective 精炼版
详细版请参考:[Paper]2009_Controllability of Multi-Agent Systems from a Graph-Theoretic Perspective 文章目录 5. ...
【Paper】2015_El H_Decentralized Control Architecture for UAV-UGV Cooperation
Decentralized Control Architecture for UAV-UGV Cooperation 1 Introduction 2 Problem Statement and Ar ...
【Paper】2003_Consensus Problems in Networks of Agents with Switching Topology and Time-Delays
此篇文章主要在于仿真代码,关于文献的解释请参考 [Paper]2003_Murr_Consensus Problems in Networks of Agents with Switching Top ...

【Paper】A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data