data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enc ...

, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behaviour. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data. Anomaly detection finds application in many domains including cyber security, medicine, machine vision, statistics, neuroscience, law enforcement and financial fraud to name only a few. Anomalies were initially searched for clear rejection or omission from the data to aid statistical analysis, for example to compute the mean or standard deviation. They were also removed to better predictions from models such as linear regression, and more recently their removal aids the performance of machine learning algorithms. However, in many applications anomalies themselves are of interest and are the observations most desirous in the entire data set, which need to be identified and separated from noise or irrelevant outliers. Three broad categories of anomaly detection techniques exist. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier. However, this approach is rarely used in anomaly detection due to the general unavailability of labelled data and the inherent unbalanced nature of the classes. Semi-supervised anomaly detection techniques assume that some portion of the data is labelled. This may be any combination of the normal or anomalous data, but more often than not the techniques construct a model representing

normal behavior Normality is a behavior that can be normal for an individual (intrapersonal normality) when it is consistent with the most common behavior for that person. Normal is also used to describe individual behavior that conforms to the most common beha ...

from a given ''normal'' training data set, and then test the likelihood of a test instance to be generated by the model. Unsupervised anomaly detection techniques assume the data is unlabelled and are by far the most commonly used due to their wider and relevant application.

Definition

Many attempts have been made in the statistical and computer science communities to define an anomaly. The most prevalent ones include: * An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. * Anomalies are instances or collections of data that occur very rarely in the data set and whose features differ significantly from most of the data. * An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. * An anomaly is a point or collection of points that is relatively distant from other points in multi-dimensional space of features. * Anomalies are patterns in data that do not conform to a well defined notion of normal behaviour. * Let T be observations from a univariate Gaussian distribution and O a point from T. Then the z-score for O is greater than a pre-selected threshold if and only if O is an outlier.

Applications

Anomaly detection is applicable in a very large number and variety of domains, and is an important subarea of unsupervised machine learning. As such it has applications in cyber-security intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor networks, detecting ecosystem disturbances, defect detection in images using

machine vision Machine vision (MV) is the technology and methods used to provide imaging-based automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance, usually in industry. Machine vision refers to ...

, medical diagnosis and law enforcement. Anomaly detection was proposed for

intrusion detection systems An intrusion detection system (IDS; also intrusion prevention system or IPS) is a device or software application that monitors a network or systems for malicious activity or policy violations. Any intrusion activity or violation is typically rep ...

(IDS) by Dorothy Denning in 1986. Anomaly detection for IDS is normally accomplished with thresholds and statistics, but can also be done with

soft computing Soft computing is a set of algorithms, including neural networks, fuzzy logic, and evolutionary algorithms. These algorithms are tolerant of imprecision, uncertainty, partial truth and approximation. It is contrasted with hard computing: al ...

, and inductive learning. Types of statistics proposed by 1999 included profiles of users, workstations, networks, remote hosts, groups of users, and programs based on frequencies, means, variances, covariances, and standard deviations. The counterpart of anomaly detection in intrusion detection is misuse detection. It is often used in preprocessing to remove anomalous data from the dataset. This is done for a number of reasons. Statistics of data such as the mean and standard deviation are more accurate after the removal of anomalies, and the visualisation of data can also be improved. In supervised learning, removing the anomalous data from the dataset often results in a statistically significant increase in accuracy. Anomalies are also often the most important observations in the data to be found such as in intrusion detection or detecting abnormalities in medical images.

Popular techniques

Many anomaly detection techniques have been proposed in literature. Some of the popular techniques are: * Statistical ( Z-score, Tukey's range test and Grubbs's test) * Density-based techniques (

k-nearest neighbor In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regressi ...

local outlier factor In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point ...

, isolation forests, and many more variations of this concept) * Subspace-, correlation-based and tensor-based outlier detection for high-dimensional data * One-class

support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratori ...

* Replicator

neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

s, autoencoders, variational autoencoders,

long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) c ...

neural networks *

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Ba ...

s *

Hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ...

s (HMMs) *

Minimum Covariance Determinant In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given ran ...

* Clustering:

Cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

-based outlier detection * Deviations from association rules and frequent itemsets * Fuzzy logic-based outlier detection * Ensemble techniques, using feature bagging, score normalization and different sources of diversity The performance of methods depends on the data set and parameters, and methods have little systematic advantages over another when compared across many data sets and parameters.

Software

ELKI ELKI (for ''Environment for DeveLoping KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally at the database s ...

is an open-source Java data mining toolkit that contains several anomaly detection algorithms, as well as index acceleration for them. *PyOD is an open-source Python library developed specifically for anomaly detection. * scikit-learn is an open-source Python library that has built functionality to provide unsupervised anomaly detection. *

Wolfram Mathematica Wolfram Mathematica is a software system with built-in libraries for several areas of technical computing that allow machine learning, statistics, symbolic computation, data manipulation, network analysis, time series analysis, NLP, optimizat ...

provides functionality for unsupervised anomaly detection across multiple data types
Mathematica documentation

Datasets

Anomaly detection benchmark data repository
with carefully chosen data sets of the

Ludwig-Maximilians-Universität München The Ludwig Maximilian University of Munich (simply University of Munich or LMU; german: Ludwig-Maximilians-Universität München) is a public research university in Munich, Germany. It is Germany's sixth-oldest university in continuous operatio ...

Mirror
at

University of São Paulo The University of São Paulo ( pt, Universidade de São Paulo, USP) is a public university in the Brazilian state of São Paulo. It is the largest Brazilian public university and the country's most prestigious educational institution, the bes ...

.
ODDS
– ODDS: A large collection of publicly available outlier detection datasets with ground truth in different domains.
Unsupervised Anomaly Detection Benchmark
at Harvard Dataverse: Datasets for Unsupervised Anomaly Detection with ground truth.
KMASH Data Repository
at Research Data Australia having more than 12,000 anomaly detection datasets with ground truth.

References