main discussion

 

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Compare and contrast the different techniques for anomaly detection that were presented in Chapter 9.

Discuss techniques for combining multiple anomaly detection techniques to improve the identification of anomalous objects.

  • Anomaly Detection
  • Lecture Notes for Chapter 9

    Introduction to Data Mining, 2nd Edition
    b

    Save Time On Research and Writing
    Hire a Pro to Write You a 100% Plagiarism-Free Paper.
    Get My Paper

    y

    Tan, Steinbach, Karpatne, Kumar

    11/20/2020 Introduction to Data Mining, 2nd Edition 1

  • Anomaly/Outlier Detection
  •  What are anomalies/outliers?
    – The set of data points that are

    considerably different than the
    remainder of the data

     Natural implication is that anomalies are
    relatively rare
    – One in a thousand occurs often if you have lots of data
    – Context is important, e.g., freezing temps in July

     Can be important or a nuisance
    – 10 foot tall 2 year old
    – Unusually high blood pressure

    11/20/2020 Introduction to Data Mining, 2nd Edition 2

  • Importance of Anomaly Detection
  • Ozone Depletion History
     In 1985 three researchers (Farman,

    Gardinar and Shanklin) were
    puzzled by data gathered by the
    British Antarctic Survey showing that
    ozone levels for Antarctica had
    dropped 10% below normal levels

     Why did the Nimbus 7 satellite,
    which had instruments aboard for
    recording ozone levels, not record
    similarly low ozone concentrations?

     The ozone concentrations recorded
    by the satellite were so low they
    were being treated as outliers by a
    computer program and discarded! Sources:

    http://exploringdata.cqu.edu.au/ozone.html
    http://www.epa.gov/ozone/science/hole/size.html

    11/20/2020 Introduction to Data Mining, 2nd Edition 3

  • Causes of Anomalies
  •  Data from different classes
    – Measuring the weights of oranges, but a few grapefruit

    are mixed in

     Natural variation
    – Unusually tall people

     Data errors
    – 200 pound 2 year old

    11/20/2020 Introduction to Data Mining, 2nd Edition 4

  • Distinction Between Noise and Anomalies
  •  Noise is erroneous, perhaps random, values or
    contaminating objects
    – Weight recorded incorrectly

    – Grapefruit mixed in with the oranges

     Noise doesn’t necessarily produce unusual values or
    objects

     Noise is not interesting
     Anomalies may be interesting if they are not a result of

    noise
     Noise and anomalies are related but distinct concepts

    11/20/2020 Introduction to Data Mining, 2nd Edition 5

  • General Issues: Number of Attributes
  •  Many anomalies are defined in terms of a single attribute
    – Heigh

    t

    – Shape
    – Color

     Can be hard to find an anomaly using all attributes
    – Noisy or irrelevant attributes
    – Object is only anomalous with respect to some attributes

     However, an object may not be anomalous in any one
    attribute

    11/20/2020 Introduction to Data Mining, 2nd Edition 6

  • General Issues: Anomaly Scoring
  •  Many anomaly detection techniques provide only a binary
    categorization
    – An object is an anomaly or it isn’t
    – This is especially true of classification-based approaches

     Other approaches assign a score to all points
    – This score measures the degree to which an object is an anomaly
    – This allows objects to be ranked

     In the end, you often need a binary decision
    – Should this credit card transaction be flagged?
    – Still useful to have a score

     How many anomalies are there?
    11/20/2020 Introduction to Data Mining, 2nd Edition 7

  • Other Issues for Anomaly Detection
  •  Find all anomalies at once or one at a time
    – Swamping
    – Masking

     Evaluation
    – How do you measure performance?
    – Supervised vs. unsupervised situations

     Efficiency

     Context
    – Professional basketball team

    11/20/2020 Introduction to Data Mining, 2nd Edition 8

  • Variants of Anomaly Detection Problems
  •  Given a data set D, find all data points x ∈ D with
    anomaly scores greater than some threshold t

     Given a data set D, find all data points x ∈ D
    having the top-n largest anomaly scores

     Given a data set D, containing mostly normal (but
    unlabeled) data points, and a test point x,
    compute the anomaly score of x with respect to

    D

    11/20/2020 Introduction to Data Mining, 2nd Edition 9

  • Model-Based Anomaly Detection
  •  Build a model for the data and see
    – Unsupervised

     Anomalies are those points that don’t fit well
     Anomalies are those points that distort the model
     Examples:

    – Statistical distribution
    – Clusters
    – Regression
    – Geometric
    – Graph

    – Supervised
     Anomalies are regarded as a rare class
     Need to have training data

    11/20/2020 Introduction to Data Mining, 2nd Edition 10

  • Additional Anomaly Detection Techniques
  •  Proximity-based
    – Anomalies are points far away from other points
    – Can detect this graphically in some cases

     Density-based
    – Low density points are outliers

     Pattern matching
    – Create profiles or templates of atypical but important

    events or objects
    – Algorithms to detect these patterns are usually simple

    and efficient

    11/20/2020 Introduction to Data Mining, 2nd Edition 11

  • Visual Approaches
  •  Boxplots or scatter plots

     Limitations
    – Not automatic
    – Subjective

    11/20/2020 Introduction to Data Mining, 2nd Edition 12

  • Statistical Approaches
  • Probabilistic definition of an outlier: An outlier is an object that
    has a low probability with respect to a probability distribution
    model of the data.

     Usually assume a parametric model describing the distribution
    of the data (e.g., normal distribution)

     Apply a statistical test that depends on
    – Data distribution
    – Parameters of distribution (e.g., mean, variance)
    – Number of expected outliers (confidence limit)

     Issues
    – Identifying the distribution of a data set

     Heavy tailed distribution
    – Number of attributes
    – Is the data a mixture of distributions?

    11/20/2020 Introduction to Data Mining, 2nd Edition 13

  • Normal Distributions
  • One-dimensional
    Gaussian

    Two-dimensional
    Gaussian

    x

    y

    -4 -3 -2 -1 0 1 2 3 4

    5

    5

    4

    3

    2

    1

    0

    1

    2

    3

    4

    5

    6

    7

    8

    probability
    density

    0.01

    0.02

    0.03

    0.04

    0.05

    0.0

    6

    0.07

    0.08

    0.09

    0.1

    11/20/2020 Introduction to Data Mining, 2nd Edition 14

  • Grubbs’ Test
  •  Detect outliers in univariate data
     Assume data comes from normal distribution
     Detects one outlier at a time, remove the outlier,

    and repeat
    – H0: There is no outlier in data
    – HA: There is at least one outlier

     Grubbs’ test statistic:

     Reject H0 if:
    s

    XX
    G

    =

    max

    2
    2

    )2,/(

    )2,/(

    2
    )1(


    +−

    >

    NN

    NN

    tN
    t

    N
    N

    G

    α

    α

    11/20/2020 Introduction to Data Mining, 2nd Edition 15

  • Statistical-based – Likelihood Approach
  •  Assume the data set D contains samples from a
    mixture of two probability distributions:
    – M (majority distribution)
    – A (anomalous distribution)

     General Approach:
    – Initially, assume all the data points belong to M
    – Let Lt(D) be the log likelihood of D at time t
    – For each point xt that belongs to M, move it to

    A

     Let Lt+1 (D) be the new log likelihood.
     Compute the difference, ∆ = Lt(D) – Lt+1 (D)
     If ∆ > c (some threshold), then xt is declared as an anomaly
    and moved permanently from M to A

    11/20/2020 Introduction to Data Mining, 2nd Edition 16

    Statistical-based – Likelihood Approach

     Data distribution, D = (1 – λ) M + λ A
     M is a probability distribution estimated from data

    – Can be based on any modeling method (naïve Bayes,
    maximum entropy, etc)

     A is initially assumed to be uniform distribution
     Likelihood at time t:

    ∑∑

    ∏∏∏

    ∈∈

    ∈∈=

    +++−=













    −==

    ti

    t
    ti
    t
    ti
    t
    t
    ti
    t
    t

    Ax
    iAt

    Mx
    iMtt

    Ax
    iA

    A

    Mx
    iM

    M
    N

    i
    iDt

    xPAxPMDLL

    xPxPxPDL

    )(loglog)(log)1log()(

    )()()1()()( ||||
    1

    λλ

    λλ

    11/20/2020 Introduction to Data Mining, 2nd Edition 17

    Strengths/Weaknesses of Statistical Approaches

     Firm mathematical foundation

     Can be very efficient

     Good results if distribution is known

     In many cases, data distribution may not be known

     For high dimensional data, it may be difficult to estimate
    the true distribution

     Anomalies can distort the parameters of the distribution

    11/20/2020 Introduction to Data Mining, 2nd Edition 18

  • Distance-Based Approaches
  •  Several different techniques

     An object is an outlier if a specified fraction of the
    objects is more than a specified distance away
    (Knorr, Ng 1998)
    – Some statistical definitions are special cases of this

     The outlier score of an object is the distance to
    its kth nearest neighbor

    11/20/2020 Introduction to Data Mining, 2nd Edition 19

  • One Nearest Neighbor – One Outlier
  • D

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    Outlier Score

    11/20/2020 Introduction to Data Mining, 2nd Edition 20

  • One Nearest Neighbor – Two Outliers
  • D
    0.05
    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    Outlier Score
    11/20/2020 Introduction to Data Mining, 2nd Edition 21

  • Five Nearest Neighbors – Small Cluster
  • D
    0.4
    0.6
    0.8
    1
    1.2
    1.4
    1.6
    1.8
    2

    Outlier Score
    11/20/2020 Introduction to Data Mining, 2nd Edition 22

  • Five Nearest Neighbors – Differing Density
  • D
    0.2
    0.4
    0.6
    0.8
    1
    1.2
    1.4
    1.6
    1.8

    Outlier Score
    11/20/2020 Introduction to Data Mining, 2nd Edition 23

    Strengths/Weaknesses of Distance-Based Approaches

     Simple

     Expensive – O(n2)

     Sensitive to parameters

     Sensitive to variations in density

     Distance becomes less meaningful in high-
    dimensional space

    11/20/2020 Introduction to Data Mining, 2nd Edition 24

  • Density-Based Approaches
  •  Density-based Outlier: The outlier score of an
    object is the inverse of the density around the
    object.
    – Can be defined in terms of the k nearest neighbors
    – One definition: Inverse of distance to kth neighbor
    – Another definition: Inverse of the average distance to k

    neighbors
    – DBSCAN definition

     If there are regions of different density, this
    approach can have problems

    11/20/2020 Introduction to Data Mining, 2nd Edition 25

  • Relative Density
  •  Consider the density of a point relative to that of
    its k nearest neighbors

    11/20/2020 Introduction to Data Mining, 2nd Edition 26

  • Relative Density Outlier Scores
  • Outlier Score
    1

    2
    3
    4
    5
    6

    6.85

    1.33

    1.40

    A

    C

    D

    11/20/2020 Introduction to Data Mining, 2nd Edition 27

  • Density-based: LOF approach
  •  For each point, compute the density of its local neighborhood
     Compute local outlier factor (LOF) of a sample p as the average of

    the ratios of the density of sample p and the density of its nearest
    neighbors

     Outliers are points with largest LOF value

    p2
    × p1

    ×

    In the NN approach, p2 is
    not considered as outlier,
    while LOF approach find
    both p1 and p2 as outliers

    11/20/2020 Introduction to Data Mining, 2nd Edition 28

    Strengths/Weaknesses of Density-Based Approaches

     Simple
     Expensive – O(n2)
     Sensitive to parameters

     Density becomes less meaningful in high-
    dimensional space

    11/20/2020 Introduction to Data Mining, 2nd Edition 29

  • Clustering-Based Approaches
  •  Clustering-based Outlier: An
    object is a cluster-based outlier if
    it does not strongly belong to any
    cluster
    – For prototype-based clusters, an

    object is an outlier if it is not close
    enough to a cluster center

    – For density-based clusters, an object
    is an outlier if its density is too low

    – For graph-based clusters, an object
    is an outlier if it is not well connected

     Other issues include the impact of
    outliers on the clusters and the
    number of clusters

    11/20/2020 Introduction to Data Mining, 2nd Edition 30

  • Distance of Points from Closest Centroids
  • Outlier Score
    0.5
    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    D
    C
    A
    1.2

    0.17

    4.6

    11/20/2020 Introduction to Data Mining, 2nd Edition 31

  • Relative Distance of Points from Closest Centroid
  • Outlier Score
    0.5
    1
    1.5
    2
    2.5
    3
    3.5
    4

    11/20/2020 Introduction to Data Mining, 2nd Edition 32

    Strengths/Weaknesses of Distance-Based Approaches
     Simple

     Many clustering techniques can be used

     Can be difficult to decide on a clustering
    technique

     Can be difficult to decide on number of clusters

     Outliers can distort the clusters

    11/20/2020 Introduction to Data Mining, 2nd Edition 33

      Anomaly Detection
      Anomaly/Outlier Detection
      Importance of Anomaly Detection
      Causes of Anomalies
      Distinction Between Noise and Anomalies
      General Issues: Number of Attributes
      General Issues: Anomaly Scoring
      Other Issues for Anomaly Detection
      Variants of Anomaly Detection Problems
      Model-Based Anomaly Detection
      Additional Anomaly Detection Techniques
      Visual Approaches
      Statistical Approaches
      Normal Distributions
      Grubbs’ Test
      Statistical-based – Likelihood Approach
      Statistical-based – Likelihood Approach

    • Strengths/Weaknesses of Statistical Approaches
    • Distance-Based Approaches
      One Nearest Neighbor – One Outlier
      One Nearest Neighbor – Two Outliers
      Five Nearest Neighbors – Small Cluster
      Five Nearest Neighbors – Differing Density

    • Strengths/Weaknesses of Distance-Based Approaches
    • Density-Based Approaches
      Relative Density
      Relative Density Outlier Scores
      Density-based: LOF approach

    • Strengths/Weaknesses of Density-Based Approaches
    • Clustering-Based Approaches
      Distance of Points from Closest Centroids
      Relative Distance of Points from Closest Centroid
      Strengths/Weaknesses of Distance-Based Approaches

    Calculate your order
    Pages (275 words)
    Standard price: $0.00
    Client Reviews
    4.9
    Sitejabber
    4.6
    Trustpilot
    4.8
    Our Guarantees
    100% Confidentiality
    Information about customers is confidential and never disclosed to third parties.
    Original Writing
    We complete all papers from scratch. You can get a plagiarism report.
    Timely Delivery
    No missed deadlines – 97% of assignments are completed in time.
    Money Back
    If you're confident that a writer didn't follow your order details, ask for a refund.

    Calculate the price of your order

    You will get a personal manager and a discount.
    We'll send you the first draft for approval by at
    Total price:
    $0.00
    Power up Your Academic Success with the
    Team of Professionals. We’ve Got Your Back.
    Power up Your Study Success with Experts We’ve Got Your Back.

    Order your essay today and save 30% with the discount code ESSAYHELP