Anomaly detection

In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior.^[1] Such examples may arouse suspicions of being generated by a different mechanism,^[2] or appear inconsistent with the remainder of that set of data.^[3]

For broader coverage of this topic, see Outlier.

Anomaly detection finds application in many domains including cybersecurity, medicine, machine vision, statistics, neuroscience, law enforcement and financial fraud to name only a few. Anomalies were initially searched for clear rejection or omission from the data to aid statistical analysis, for example to compute the mean or standard deviation. They were also removed to better predictions from models such as linear regression, and more recently their removal aids the performance of machine learning algorithms. However, in many applications anomalies themselves are of interest and are the observations most desirous in the entire data set, which need to be identified and separated from noise or irrelevant outliers.

Three broad categories of anomaly detection techniques exist.^[1] Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier. However, this approach is rarely used in anomaly detection due to the general unavailability of labelled data and the inherent unbalanced nature of the classes. Semi-supervised anomaly detection techniques assume that some portion of the data is labelled. This may be any combination of the normal or anomalous data, but more often than not, the techniques construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the model. Unsupervised anomaly detection techniques assume the data is unlabelled and are by far the most commonly used due to their wider and relevant application.

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.

[2]

Anomalies are instances or collections of data that occur very rarely in the data set and whose features differ significantly from most of the data.

An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data.

[3]

An anomaly is a point or collection of points that is relatively distant from other points in multi-dimensional space of features.

Anomalies are patterns in data that do not conform to a well-defined notion of normal behaviour.

[1]

History[edit]

Intrusion detection[edit]

The concept of intrusion detection, a critical component of anomaly detection, has evolved significantly over time. Initially, it was a manual process where system administrators would monitor for unusual activities, such as a vacationing user's account being accessed or unexpected printer activity. This approach was not scalable and was soon superseded by the analysis of audit logs and system logs for signs of malicious behavior.^[5]

By the late 1970s and early 1980s, the analysis of these logs was primarily used retrospectively to investigate incidents, as the volume of data made it impractical for real-time monitoring. The affordability of digital storage eventually led to audit logs being analyzed online, with specialized programs being developed to sift through the data. These programs, however, were typically run during off-peak hours due to their computational intensity.^[5]

The 1990s brought the advent of real-time intrusion detection systems capable of analyzing audit data as it was generated, allowing for immediate detection of and response to attacks. This marked a significant shift towards proactive intrusion detection.^[5]

As the field has continued to develop, the focus has shifted to creating solutions that can be efficiently implemented across large and complex network environments, adapting to the ever-growing variety of security threats and the dynamic nature of modern computing infrastructures.^[5]

,

Z-score

Tukey's range test

Grubbs's test

The Subspace Outlier Degree (SOD) identifies attributes where a sample is normal, and attributes in which the sample deviates from the expected.

[31]

Correlation Outlier Probabilities (COP) compute an error vector of how a sample point deviates from an expected location, which can be interpreted as a counterfactual explanation: the sample would be normal if it were moved to that location.

[32]

Many of the methods discussed above only yield an anomaly score prediction, which often can be explained to users as the point being in a region of low data density (or relatively low density compared to the neighbor's densities). In explainable artificial intelligence, the users demand methods with higher explainability. Some methods allow for more detailed explanations:

is an open-source Java data mining toolkit that contains several anomaly detection algorithms, as well as index acceleration for them.

ELKI

PyOD is an open-source Python library developed specifically for anomaly detection.

[51]

is an open-source Python library that contains some algorithms for unsupervised anomaly detection.

scikit-learn

provides functionality for unsupervised anomaly detection across multiple data types ^[52]

Wolfram Mathematica

with carefully chosen data sets of the Ludwig-Maximilians-Universität München; Mirror Archived 2022-03-31 at the Wayback Machine at University of São Paulo.

Anomaly detection benchmark data repository

– ODDS: A large collection of publicly available outlier detection datasets with ground truth in different domains.

ODDS

at Harvard Dataverse: Datasets for Unsupervised Anomaly Detection with ground truth.

Unsupervised Anomaly Detection Benchmark

at Research Data Australia having more than 12,000 anomaly detection datasets with ground truth.