main discussion
Compare and contrast the different techniques for anomaly detection that were presented in Chapter 9.
Discuss techniques for combining multiple anomaly detection techniques to improve the identification of anomalous objects.
Lecture Notes for Chapter 9
Introduction to Data Mining, 2nd Edition
b
y
Tan, Steinbach, Karpatne, Kumar
11/20/2020 Introduction to Data Mining, 2nd Edition 1
What are anomalies/outliers?
– The set of data points that are
considerably different than the
remainder of the data
Natural implication is that anomalies are
relatively rare
– One in a thousand occurs often if you have lots of data
– Context is important, e.g., freezing temps in July
Can be important or a nuisance
– 10 foot tall 2 year old
– Unusually high blood pressure
11/20/2020 Introduction to Data Mining, 2nd Edition 2
Ozone Depletion History
In 1985 three researchers (Farman,
Gardinar and Shanklin) were
puzzled by data gathered by the
British Antarctic Survey showing that
ozone levels for Antarctica had
dropped 10% below normal levels
Why did the Nimbus 7 satellite,
which had instruments aboard for
recording ozone levels, not record
similarly low ozone concentrations?
The ozone concentrations recorded
by the satellite were so low they
were being treated as outliers by a
computer program and discarded! Sources:
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html
11/20/2020 Introduction to Data Mining, 2nd Edition 3
Data from different classes
– Measuring the weights of oranges, but a few grapefruit
are mixed in
Natural variation
– Unusually tall people
Data errors
– 200 pound 2 year old
11/20/2020 Introduction to Data Mining, 2nd Edition 4
Noise is erroneous, perhaps random, values or
contaminating objects
– Weight recorded incorrectly
– Grapefruit mixed in with the oranges
Noise doesn’t necessarily produce unusual values or
objects
Noise is not interesting
Anomalies may be interesting if they are not a result of
noise
Noise and anomalies are related but distinct concepts
11/20/2020 Introduction to Data Mining, 2nd Edition 5
Many anomalies are defined in terms of a single attribute
– Heigh
t
– Shape
– Color
Can be hard to find an anomaly using all attributes
– Noisy or irrelevant attributes
– Object is only anomalous with respect to some attributes
However, an object may not be anomalous in any one
attribute
11/20/2020 Introduction to Data Mining, 2nd Edition 6
Many anomaly detection techniques provide only a binary
categorization
– An object is an anomaly or it isn’t
– This is especially true of classification-based approaches
Other approaches assign a score to all points
– This score measures the degree to which an object is an anomaly
– This allows objects to be ranked
In the end, you often need a binary decision
– Should this credit card transaction be flagged?
– Still useful to have a score
How many anomalies are there?
11/20/2020 Introduction to Data Mining, 2nd Edition 7
Find all anomalies at once or one at a time
– Swamping
– Masking
Evaluation
– How do you measure performance?
– Supervised vs. unsupervised situations
Efficiency
Context
– Professional basketball team
11/20/2020 Introduction to Data Mining, 2nd Edition 8
Given a data set D, find all data points x ∈ D with
anomaly scores greater than some threshold t
Given a data set D, find all data points x ∈ D
having the top-n largest anomaly scores
Given a data set D, containing mostly normal (but
unlabeled) data points, and a test point x,
compute the anomaly score of x with respect to
D
11/20/2020 Introduction to Data Mining, 2nd Edition 9
Build a model for the data and see
– Unsupervised
Anomalies are those points that don’t fit well
Anomalies are those points that distort the model
Examples:
– Statistical distribution
– Clusters
– Regression
– Geometric
– Graph
– Supervised
Anomalies are regarded as a rare class
Need to have training data
11/20/2020 Introduction to Data Mining, 2nd Edition 10
Proximity-based
– Anomalies are points far away from other points
– Can detect this graphically in some cases
Density-based
– Low density points are outliers
Pattern matching
– Create profiles or templates of atypical but important
events or objects
– Algorithms to detect these patterns are usually simple
and efficient
11/20/2020 Introduction to Data Mining, 2nd Edition 11
Boxplots or scatter plots
Limitations
– Not automatic
– Subjective
11/20/2020 Introduction to Data Mining, 2nd Edition 12
Probabilistic definition of an outlier: An outlier is an object that
has a low probability with respect to a probability distribution
model of the data.
Usually assume a parametric model describing the distribution
of the data (e.g., normal distribution)
Apply a statistical test that depends on
– Data distribution
– Parameters of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
Issues
– Identifying the distribution of a data set
Heavy tailed distribution
– Number of attributes
– Is the data a mixture of distributions?
11/20/2020 Introduction to Data Mining, 2nd Edition 13
One-dimensional
Gaussian
Two-dimensional
Gaussian
x
y
-4 -3 -2 -1 0 1 2 3 4
5
–
5
–
4
–
3
–
2
–
1
0
1
2
3
4
5
6
7
8
probability
density
0.01
0.02
0.03
0.04
0.05
0.0
6
0.07
0.08
0.09
0.1
11/20/2020 Introduction to Data Mining, 2nd Edition 14
Detect outliers in univariate data
Assume data comes from normal distribution
Detects one outlier at a time, remove the outlier,
and repeat
– H0: There is no outlier in data
– HA: There is at least one outlier
Grubbs’ test statistic:
Reject H0 if:
s
XX
G
−
=
max
2
2
)2,/(
)2,/(
2
)1(
−
−
+−
−
>
NN
NN
tN
t
N
N
G
α
α
11/20/2020 Introduction to Data Mining, 2nd Edition 15
Assume the data set D contains samples from a
mixture of two probability distributions:
– M (majority distribution)
– A (anomalous distribution)
General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For each point xt that belongs to M, move it to
A
Let Lt+1 (D) be the new log likelihood.
Compute the difference, ∆ = Lt(D) – Lt+1 (D)
If ∆ > c (some threshold), then xt is declared as an anomaly
and moved permanently from M to A
11/20/2020 Introduction to Data Mining, 2nd Edition 16
Statistical-based – Likelihood Approach
Data distribution, D = (1 – λ) M + λ A
M is a probability distribution estimated from data
– Can be based on any modeling method (naïve Bayes,
maximum entropy, etc)
A is initially assumed to be uniform distribution
Likelihood at time t:
∑∑
∏∏∏
∈∈
∈∈=
+++−=
−==
ti
t
ti
t
ti
t
t
ti
t
t
Ax
iAt
Mx
iMtt
Ax
iA
A
Mx
iM
M
N
i
iDt
xPAxPMDLL
xPxPxPDL
)(loglog)(log)1log()(
)()()1()()( ||||
1
λλ
λλ
11/20/2020 Introduction to Data Mining, 2nd Edition 17
Strengths/Weaknesses of Statistical Approaches
Firm mathematical foundation
Can be very efficient
Good results if distribution is known
In many cases, data distribution may not be known
For high dimensional data, it may be difficult to estimate
the true distribution
Anomalies can distort the parameters of the distribution
11/20/2020 Introduction to Data Mining, 2nd Edition 18
Several different techniques
An object is an outlier if a specified fraction of the
objects is more than a specified distance away
(Knorr, Ng 1998)
– Some statistical definitions are special cases of this
The outlier score of an object is the distance to
its kth nearest neighbor
11/20/2020 Introduction to Data Mining, 2nd Edition 19
D
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Outlier Score
11/20/2020 Introduction to Data Mining, 2nd Edition 20
D
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Outlier Score
11/20/2020 Introduction to Data Mining, 2nd Edition 21
D
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Outlier Score
11/20/2020 Introduction to Data Mining, 2nd Edition 22
D
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Outlier Score
11/20/2020 Introduction to Data Mining, 2nd Edition 23
Strengths/Weaknesses of Distance-Based Approaches
Simple
Expensive – O(n2)
Sensitive to parameters
Sensitive to variations in density
Distance becomes less meaningful in high-
dimensional space
11/20/2020 Introduction to Data Mining, 2nd Edition 24
Density-based Outlier: The outlier score of an
object is the inverse of the density around the
object.
– Can be defined in terms of the k nearest neighbors
– One definition: Inverse of distance to kth neighbor
– Another definition: Inverse of the average distance to k
neighbors
– DBSCAN definition
If there are regions of different density, this
approach can have problems
11/20/2020 Introduction to Data Mining, 2nd Edition 25
Consider the density of a point relative to that of
its k nearest neighbors
11/20/2020 Introduction to Data Mining, 2nd Edition 26
Outlier Score
1
2
3
4
5
6
6.85
1.33
1.40
A
C
D
11/20/2020 Introduction to Data Mining, 2nd Edition 27
For each point, compute the density of its local neighborhood
Compute local outlier factor (LOF) of a sample p as the average of
the ratios of the density of sample p and the density of its nearest
neighbors
Outliers are points with largest LOF value
p2
× p1
×
In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
11/20/2020 Introduction to Data Mining, 2nd Edition 28
Strengths/Weaknesses of Density-Based Approaches
Simple
Expensive – O(n2)
Sensitive to parameters
Density becomes less meaningful in high-
dimensional space
11/20/2020 Introduction to Data Mining, 2nd Edition 29
Clustering-based Outlier: An
object is a cluster-based outlier if
it does not strongly belong to any
cluster
– For prototype-based clusters, an
object is an outlier if it is not close
enough to a cluster center
– For density-based clusters, an object
is an outlier if its density is too low
– For graph-based clusters, an object
is an outlier if it is not well connected
Other issues include the impact of
outliers on the clusters and the
number of clusters
11/20/2020 Introduction to Data Mining, 2nd Edition 30
Outlier Score
0.5
1
1.5
2
2.5
3
3.5
4
4.5
D
C
A
1.2
0.17
4.6
11/20/2020 Introduction to Data Mining, 2nd Edition 31
Outlier Score
0.5
1
1.5
2
2.5
3
3.5
4
11/20/2020 Introduction to Data Mining, 2nd Edition 32
Strengths/Weaknesses of Distance-Based Approaches
Simple
Many clustering techniques can be used
Can be difficult to decide on a clustering
technique
Can be difficult to decide on number of clusters
Outliers can distort the clusters
11/20/2020 Introduction to Data Mining, 2nd Edition 33
- Anomaly Detection
- Strengths/Weaknesses of Statistical Approaches
- Strengths/Weaknesses of Distance-Based Approaches
- Strengths/Weaknesses of Density-Based Approaches
Anomaly/Outlier Detection
Importance of Anomaly Detection
Causes of Anomalies
Distinction Between Noise and Anomalies
General Issues: Number of Attributes
General Issues: Anomaly Scoring
Other Issues for Anomaly Detection
Variants of Anomaly Detection Problems
Model-Based Anomaly Detection
Additional Anomaly Detection Techniques
Visual Approaches
Statistical Approaches
Normal Distributions
Grubbs’ Test
Statistical-based – Likelihood Approach
Statistical-based – Likelihood Approach
Distance-Based Approaches
One Nearest Neighbor – One Outlier
One Nearest Neighbor – Two Outliers
Five Nearest Neighbors – Small Cluster
Five Nearest Neighbors – Differing Density
Density-Based Approaches
Relative Density
Relative Density Outlier Scores
Density-based: LOF approach
Clustering-Based Approaches
Distance of Points from Closest Centroids
Relative Distance of Points from Closest Centroid
Strengths/Weaknesses of Distance-Based Approaches