Data Mining Project
Project: Need it in 20 hours
You have been asked by management (manufacturing, healthcare, retail, financial, and etc. ) to create a demo using a data analytic or BI tool. It is your responsibility to download and produce outputs using one of the tools. You will need to focus your results on the data set you select.
Ensure to address at least one topic covered in Chapters 15 with the outputs. The paper should include the following as Header sections.
Introduction
History of Tool [Discuss the benefits and limitations]
Review of the Data [What are you reviewing?]
Exploring the Data with the tool
Classifications Basic Concepts and Decision Trees
Classifications Alternative Techniques
Summary of Results
References (minimum of 2 – each must be cited at least once (of course)).
Ensure you adhere to APA7
Types of Data Analytic Tools
https://www.octoparse.com/blog/top30bigdatatoolsfordataanalysis/
Excel with Solver, but has limitations
R Studio
Tableau Public has a free trial
Microsoft Power BI
Search for others with trial options
Examples of Dataset
https://www.forbes.com/sites/bernardmarr/2016/02/12/bigdata35brilliantandfreedatasourcesfor2016/#4b3e96f1b54d
Discussion : Need in 12 hours:
This week, I want you to post the particular data analytics tool that you used for the midterm project. Provide the benefits and limitations of the tool.
Replies and Explanation: Need in 2 days
You need to make the initial post and also respond to two of your peers. You’ll also need to go back and do the midterm project and explain why no one told me it wasn’t available :o
Please include the Data Analytic Vendor Name and URL. It will also be helpful to list the pricing and trial options.
Data Mining
Classification: Alternative Techniques
Lecture Notes for Chapter 5
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 *
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
RuleBased Classifier
Classify records by using a collection of “if…then…” rules
Rule: (Condition) y
where
Condition is a conjunctions of attributes
y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
(Blood Type=Warm) (Lay Eggs=Yes) Birds
(Taxable Income < 50K) (Refund=Yes) Evade=No
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Rulebased Classifier (Example)
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Application of RuleBased Classifier
A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Rule Coverage and Accuracy
Coverage of a rule:
Fraction of records that satisfy the antecedent of a rule
Accuracy of a rule:
Fraction of records that satisfy both the antecedent and consequent of a rule
(Status=Single) No
Coverage = 40%, Accuracy = 50%
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Tid
Refund
Marital
Status
Taxable
Income
Class
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
How does Rulebased Classifier Work?
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Characteristics of RuleBased Classifier
Mutually exclusive rules
Classifier contains mutually exclusive rules if the rules are independent of each other
Every record is covered by at most one rule
Exhaustive rules
Classifier has exhaustive coverage if it accounts for every possible combination of attribute values
Each record is covered by at least one rule
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
From Decision Trees To Rules
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Rules Can Be Simplified
Initial Rule: (Refund=No) (Status=Married) No
Simplified Rule: (Status=Married) No
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Effect of Rule Simplification
Rules are no longer mutually exclusive
A record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set – use voting schemes
Rules are no longer exhaustive
A record may not trigger any rules
Solution?
Use a default class
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Ordered Rule Set
Rules are rank ordered according to their priority
An ordered rule set is known as a decision list
When a test record is presented to the classifier
It is assigned to the class label of the highest ranked rule it has triggered
If none of the rules fired, it is assigned to the default class
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Rule Ordering Schemes
Rulebased ordering
Individual rules are ranked based on their quality
Classbased ordering
Rules that belong to the same class appear together
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Rulebased Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
�
Classbased Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No
(Refund=No, Marital Status={Married}) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes
�
Building Classification Rules
Direct Method:
Extract rules directly from data
e.g.: RIPPER, CN2, Holte’s 1R
Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Direct Method: Sequential Covering
Start from an empty rule
Grow a rule using the LearnOneRule function
Remove training records covered by the rule
Repeat Step (2) and (3) until stopping criterion is met
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Example of Sequential Covering
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Example of Sequential Covering…
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Aspects of Sequential Covering
Rule Growing
Instance Elimination
Rule Evaluation
Stopping Criterion
Rule Pruning
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Rule Growing
Two common strategies
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(a) Generaltospecific�
Refund= No�
Status = Single�
Status = Divorced�
Status = Married�
Income �> 80K �
…�
{ }�
Yes: 0
No: 3�
Yes: 3
No: 4�
Yes: 3
No: 4�
Yes: 2
No: 1�
Yes: 1
No: 0�
Yes: 3
No: 1�
Refund=No,�Status=Single,�Income=85K
(Class=Yes)�
Refund=No,�Status=Single,�Income=90K
(Class=Yes)�
Refund=No,�Status = Single�(Class = Yes)�
(b) Specifictogeneral�
Rule Growing (Examples)
CN2 Algorithm:
Start from an empty conjunct: {}
Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …
Determine the rule consequent by taking majority class of instances covered by the rule
RIPPER Algorithm:
Start from an empty rule: {} => class
Add conjuncts that maximizes FOIL’s information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]
where t: number of positive instances covered by both R0 and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
p1: number of positive instances covered by R1
n1: number of negative instances covered by R1
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Instance Elimination
Why do we need to eliminate instances?
Otherwise, the next rule is identical to previous rule
Why do we remove positive instances?
Ensure that the next rule is different
Why do we remove negative instances?
Prevent underestimating accuracy of rule
Compare rules R2 and R3 in the diagram
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
�
class = +�
class = �
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
+�
�
�
�
�
�
�
�
+�
+�
+�
+�
+�
+�
+�
R1�
R3�
R2�
+�
Rule Evaluation
Metrics:
Accuracy
Laplace
Mestimate
n : Number of instances covered by rule
nc : Number of instances covered by rule
k : Number of classes
p : Prior probability
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Stopping Criterion and Rule Pruning
Stopping criterion
Compute the gain
If gain is not significant, discard the new rule
Rule Pruning
Similar to postpruning of decision trees
Reduced Error Pruning:
Remove one of the conjuncts in the rule
Compare error rate on validation set before and after pruning
If error improves, prune the conjunct
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Summary of Direct Method
Grow a single rule
Remove Instances from rule
Prune the rule (if necessary)
Add rule to Current Rule Set
Repeat
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Direct Method: RIPPER
For 2class problem, choose one of the classes as positive class, and the other as negative class
Learn rules for positive class
Negative class will be default class
For multiclass problem
Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class)
Learn the rule set for smallest class first, treat the rest as negative class
Repeat with next smallest class as positive class
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Direct Method: RIPPER
Growing a rule:
Start from empty rule
Add conjuncts as long as they improve FOIL’s information gain
Stop when rule no longer covers negative examples
Prune the rule immediately using incremental reduced error pruning
Measure for pruning: v = (pn)/(p+n)
p: number of positive examples covered by the rule in
the validation set
n: number of negative examples covered by the rule in
the validation set
Pruning method: delete any final sequence of conditions that maximizes v
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Direct Method: RIPPER
Building a Rule Set:
Use sequential covering algorithm
Finds the best rule that covers the current set of positive examples
Eliminate both positive and negative examples covered by the rule
Each time a rule is added to the rule set, compute the new description length
stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Direct Method: RIPPER
Optimize the rule set:
For each rule r in the rule set R
Consider 2 alternative rules:
Replacement rule (r*): grow new rule from scratch
Revised rule(r’): add conjuncts to extend the rule r
Compare the rule set for r against the rule set for r*
and r’
Choose rule set that minimizes MDL principle
Repeat rule generation and rule optimization for the remaining positive examples
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Indirect Methods
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
P�
Rule Set
r1: (P=No,Q=No) ==> –
r2: (P=No,Q=Yes) ==> +
r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> –
r5: (P=Yes,R=Yes,Q=Yes) ==> +�
Q�
R�
Q�
�
+�
+�
�
+�
No�
No�
No�
No�
Yes�
Yes�
Yes�
Yes�
Indirect Method: C4.5rules
Extract rules from an unpruned decision tree
For each rule, r: A y,
consider an alternative rule r’: A’ y where A’ is obtained by removing one of the conjuncts in A
Compare the pessimistic error rate for r against all r’s
Prune if one of the r’s has lower pessimistic error rate
Repeat until we can no longer improve generalization error
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Indirect Method: C4.5rules
Instead of ordering the rules, order subsets of rules (class ordering)
Each subset is a collection of rules with the same rule consequent (class)
Compute description length of each subset
Description length = L(error) + g L(model)
g is a parameter that takes into account the presence of redundant attributes in a rule set
(default value = 0.5)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Example
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
animals2
Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class
human yes no no no yes mammals
python no yes no no no reptiles
salmon no yes no yes no fishes
whale yes no no yes no mammals
frog no yes no sometimes yes amphibians
komodo no yes no no yes reptiles
bat yes no yes no yes mammals
pigeon no yes yes no yes birds
cat yes no no no yes mammals
leopard shark yes no no yes no fishes
turtle no yes no sometimes yes reptiles
penguin no yes no sometimes yes birds
porcupine yes no no no yes mammals
eel no yes no yes no fishes
salamander no yes no sometimes yes amphibians
gila monster no yes no no yes reptiles
platypus no yes no no yes mammals
owl no yes yes no yes birds
dolphin yes no no yes no mammals
eagle no yes yes no yes birds
C4.5 versus C4.5rules versus RIPPER
C4.5rules:
(Give Birth=No, Can Fly=Yes) Birds
(Give Birth=No, Live in Water=Yes) Fishes
(Give Birth=Yes) Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) Reptiles
( ) Amphibians
RIPPER:
(Live in Water=Yes) Fishes
(Have Legs=No) Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
Reptiles
(Can Fly=Yes,Give Birth=No) Birds
() Mammals
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
C4.5 versus C4.5rules versus RIPPER
C4.5 and C4.5rules:
RIPPER:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sheet1
Amphibians Fishes Reptiles Birds Mammals
ACTUAL Amphibians 0 0 0 0 2
CLASS Fishes 0 3 0 0 0
Reptiles 0 0 3 0 1
Birds 0 0 1 2 1
Mammals 0 2 1 0 4
C4.5 Amphibians Fishes Reptiles Birds Mammals
Amphibians 2 0 0 0 0
Fishes 0 2 0 0 1
Reptiles 1 0 3 0 0
Birds 1 0 0 3 0
Mammals 0 0 1 0 6
C4.5rules Amphibians Fishes Reptiles Birds Mammals
Amphibians 2 0 0 0 0
Fishes 0 2 0 0 1
Reptiles 1 0 3 0 0
Birds 1 0 0 3 0
Mammals 0 0 1 0 6
Sheet2
Sheet3
Sheet1
RIPPER: Amphibians Fishes Reptiles Birds Mammals
Amphibians 0 0 0 0 2
Fishes 0 3 0 0 0
Reptiles 0 0 3 0 1
Birds 0 0 1 2 1
Mammals 0 2 1 0 4
PREDICTED CLASS
Amphibians Fishes Reptiles Birds Mammals
ACTUAL Amphibians 2 0 0 0 0
CLASS Fishes 0 2 0 0 1
Reptiles 1 0 3 0 0
Birds 1 0 0 3 0
Mammals 0 0 1 0 6
C4.5rules Amphibians Fishes Reptiles Birds Mammals
Amphibians 2 0 0 0 0
Fishes 0 2 0 0 1
Reptiles 1 0 3 0 0
Birds 1 0 0 3 0
Mammals 0 0 1 0 6
Sheet2
Sheet3
Advantages of RuleBased Classifiers
As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to decision trees
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
InstanceBased Classifiers
Store the training records
Use training records to
predict the class label of
unseen cases
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Instance Based Classifiers
Examples:
Rotelearner
Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly
Nearest neighbor
Uses k “closest” points (nearest neighbors) for performing classification
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then it’s probably a duck
Training Records
Test Record
Compute Distance
Choose k of the “nearest” records
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
NearestNeighbor Classifiers
Requires three things
The set of stored records
Distance Metric to compute distance between records
The value of k, the number of nearest neighbors to retrieve
To classify an unknown record:
Compute distance to other training records
Identify k nearest neighbors
Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Unknown record�
�
Definition of Nearest Neighbor
Knearest neighbors of a record x are data points that have the k smallest distance to x
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
1 nearestneighbor
Voronoi Diagram
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Nearest Neighbor Classification
Compute distance between two points:
Euclidean distance
Determine the class from nearest neighbor list
take the majority vote of class labels among the knearest neighbors
Weigh the vote according to distance
weight factor, w = 1/d2
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Nearest Neighbor Classification…
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
�
X�
Nearest Neighbor Classification…
Scaling issues
Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Nearest Neighbor Classification…
Problem with Euclidean measure:
High dimensional data
curse of dimensionality
Can produce counterintuitive results
1 1 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1
vs
d = 1.4142
d = 1.4142
Solution: Normalize the vectors to unit length
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Nearest neighbor Classification…
kNN classifiers are lazy learners
It does not build models explicitly
Unlike eager learners such as decision tree induction and rulebased systems
Classifying unknown records are relatively expensive
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Example: PEBLS
PEBLS: Parallel ExamplarBased Learning System (Cost & Salzberg)
Works with both continuous and nominal features
For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)
Each record is assigned a weight factor
Number of nearest neighbor, k = 1
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Example: PEBLS
Distance between nominal attribute values:
d(Single,Married)
=  2/4 – 0/4  +  2/4 – 4/4  = 1
d(Single,Divorced)
=  2/4 – 1/2  +  2/4 – 1/2  = 0
d(Married,Divorced)
=  0/4 – 1/2  +  4/4 – 1/2  = 1
d(Refund=Yes,Refund=No)
=  0/3 – 3/7  +  3/3 – 4/7  = 6/7
Class Marital Status
Single Married Divorced
Yes 2 0 1
No 2 4 1
Class Refund
Yes No
Yes 0 3
No 3 4
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Example: PEBLS
Distance between record X and record Y:
where:
wX 1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Tid
Refund
Marital
Status
Taxable
Income
Cheat
X
Yes
Single
125K
No
Y
No
Married
100K
No
10
Bayes Classifier
A probabilistic framework for solving classification problems
Conditional Probability:
Bayes theorem:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Example of Bayes Theorem
Given:
A doctor knows that meningitis causes stiff neck 50% of the time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, what’s the probability he/she has meningitis?
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Bayesian Classifiers
Consider each attribute and class label as random variables
Given a record with attributes (A1, A2,…,An)
Goal is to predict class C
Specifically, we want to find the value of C that maximizes P(C A1, A2,…,An )
Can we estimate P(C A1, A2,…,An ) directly from data?
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Bayesian Classifiers
Approach:
compute the posterior probability P(C  A1, A2, …, An) for all values of C using the Bayes theorem
Choose value of C that maximizes
P(C  A1, A2, …, An)
Equivalent to choosing value of C that maximizes
P(A1, A2, …, AnC) P(C)
How to estimate P(A1, A2, …, An  C )?
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Naïve Bayes Classifier
Assume independence among attributes Ai when class is given:
P(A1, A2, …, An C) = P(A1 Cj) P(A2 Cj)… P(An Cj)
Can estimate P(Ai Cj) for all Ai and Cj.
New point is classified to Cj if P(Cj) P(Ai Cj) is maximal.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
How to Estimate Probabilities from Data?
Class: P(C) = Nc/N
e.g., P(No) = 7/10,
P(Yes) = 3/10
For discrete attributes:
P(Ai  Ck) = Aik/ Nc
where Aik is number of instances having attribute Ai and belongs to class Ck
Examples:
P(Status=MarriedNo) = 4/7
P(Refund=YesYes)=0
k
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
How to Estimate Probabilities from Data?
For continuous attributes:
Discretize the range into bins
one ordinal attribute per bin
violates independence assumption
Twoway split: (A < v) or (A > v)
choose only one of the two splits as new attribute
Probability density estimation:
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, can use it to estimate the conditional probability P(Aic)
k
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
How to Estimate Probabilities from Data?
Normal distribution:
One for each (Ai,ci) pair
For (Income, Class=No):
If Class=No
sample mean = 110
sample variance = 2975
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Example of Naïve Bayes Classifier
P(XClass=No) = P(Refund=NoClass=No)
P(Married Class=No)
P(Income=120K Class=No)
= 4/7 4/7 0.0072 = 0.0024
P(XClass=Yes) = P(Refund=No Class=Yes)
P(Married Class=Yes)
P(Income=120K Class=Yes)
= 1 0 1.2 109 = 0
Since P(XNo)P(No) > P(XYes)P(Yes)
Therefore P(NoX) > P(YesX)
=> Class = No
Given a Test Record:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Naïve Bayes Classifier
If one of the conditional probability is zero, then the entire expression becomes zero
Probability estimation:
c: number of classes
p: prior probability
m: parameter
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Example of Naïve Bayes Classifier
A: attributes
M: mammals
N: nonmammals
P(AM)P(M) > P(AN)P(N)
=> Mammals
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
animals2
Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class
human yes no no no yes mammals
python no yes no no no reptiles
salmon no yes no yes no fishes
whale yes no no yes no mammals
frog no yes no sometimes yes amphibians
komodo no yes no no yes reptiles
bat yes no yes no yes mammals
pigeon no yes yes no yes birds
cat yes no no no yes mammals
leopard shark yes no no yes no fishes
turtle no yes no sometimes yes reptiles
penguin no yes no sometimes yes birds
porcupine yes no no no yes mammals
eel no yes no yes no fishes
salamander no yes no sometimes yes amphibians
gila monster no yes no no yes reptiles
platypus no yes no no yes mammals
owl no yes yes no yes birds
dolphin yes no no yes no mammals
eagle no yes yes no yes birds
Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class
human yes no no no yes mammals
python no yes no no no nonmammals
salmon no yes no yes no nonmammals
whale yes no no yes no mammals
frog no yes no sometimes yes nonmammals
komodo no yes no no yes nonmammals
bat yes no yes no yes mammals
pigeon no yes yes no yes nonmammals
cat yes no no no yes mammals
leopard shark yes no no yes no nonmammals
turtle no yes no sometimes yes nonmammals
penguin no yes no sometimes yes nonmammals
porcupine yes no no no yes mammals
eel no yes no yes no nonmammals
salamander no yes no sometimes yes nonmammals
gila monster no yes no no yes nonmammals
platypus no yes no no yes mammals
owl no yes yes no yes nonmammals
dolphin yes no no yes no mammals
eagle no yes yes no yes nonmammals
animals2
Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class
human yes no no no yes mammals
python no yes no no no reptiles
salmon no yes no yes no fishes
whale yes no no yes no mammals
frog no yes no sometimes yes amphibians
komodo no yes no no yes reptiles
bat yes no yes no yes mammals
pigeon no yes yes no yes birds
cat yes no no no yes mammals
leopard shark yes no no yes no fishes
turtle no yes no sometimes yes reptiles
penguin no yes no sometimes yes birds
porcupine yes no no no yes mammals
eel no yes no yes no fishes
salamander no yes no sometimes yes amphibians
gila monster no yes no no yes reptiles
platypus no yes no no yes mammals
owl no yes yes no yes birds
dolphin yes no no yes no mammals
eagle no yes yes no yes birds
Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class
human yes no no no yes mammals
python no yes no no no nonmammals
salmon no yes no yes no nonmammals
whale yes no no yes no mammals
frog no yes no sometimes yes nonmammals
komodo no yes no no yes nonmammals
bat yes no yes no yes mammals
pigeon no yes yes no yes nonmammals
cat yes no no no yes mammals
leopard shark yes no no yes no nonmammals
turtle no yes no sometimes yes nonmammals
penguin no yes no sometimes yes nonmammals
porcupine yes no no no yes mammals
eel no yes no yes no nonmammals
salamander no yes no sometimes yes nonmammals
gila monster no yes no no yes nonmammals
platypus no yes no no yes mammals
owl no yes yes no yes nonmammals
dolphin yes no no yes no mammals
eagle no yes yes no yes nonmammals
Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class
human yes no no yes no ?
Naïve Bayes (Summary)
Robust to isolated noise points
Handle missing values by ignoring the instance during probability estimate calculations
Robust to irrelevant attributes
Independence assumption may not hold for some attributes
Use other techniques such as Bayesian Belief Networks (BBN)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Artificial Neural Networks (ANN)
Output Y is 1 if at least two of the three inputs are equal to 1.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
X1�
X2�
X3�
Y�
Black box�
Output �
Input �
Artificial Neural Networks (ANN)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
S�
X1�
X2�
X3�
Y�
Black box�
0.3�
0.3�
0.3�
t=0.4�
Output node�
�
Input nodes�
�
�
�
Artificial Neural Networks (ANN)
Model is an assembly of interconnected nodes and weighted links
Output node sums up each of its input value according to the weights of its links
Compare output node against some threshold t
Perceptron Model
or
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
S�
X1�
X2�
X3�
Y�
Black box�
w1�
w2�
w3�
t�
Output node�
�
Input nodes�
�
�
�
General Structure of ANN
Training ANN means learning the weights of the neurons
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
�
Activation function �g(Si )�
Si�
Oi�
I1�
I2�
I3�
wi1�
wi2�
wi3�
Oi�
Neuron i�
Input�
Output�
threshold, t�
Input Layer�
Hidden Layer�
Output Layer�
x1�
x2�
x3�
x4�
x5�
y�
Algorithm for learning ANN
Initialize the weights (w0, w1, …, wk)
Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples
Objective function:
Find the weights wi’s that minimize the above objective function
e.g., backpropagation algorithm (see lecture notes)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Support Vector Machines
One Possible Solution
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
B1�
Support Vector Machines
Another possible solution
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
B2�
Support Vector Machines
Other possible solutions
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
B2�
Support Vector Machines
Which one is better? B1 or B2?
How do you define better?
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
B1�
B2�
Support Vector Machines
Find hyperplane maximizes the margin => B1 is better than B2
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
B1�
B2�
b11�
b12�
b21�
b22�
margin�
Support Vector Machines
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
B1�
b11�
b12�
Support Vector Machines
We want to maximize:
Which is equivalent to minimizing:
But subjected to the following constraints:
This is a constrained optimization problem
Numerical approaches to solve it (e.g., quadratic programming)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Support Vector Machines
What if the problem is not linearly separable?
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Support Vector Machines
What if the problem is not linearly separable?
Introduce slack variables
Need to minimize:
Subject to:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Nonlinear Support Vector Machines
What if decision boundary is not linear?
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Nonlinear Support Vector Machines
Transform data into higher dimensional space
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Ensemble Methods
Construct a set of classifiers from the training data
Predict class label of previously unseen records by aggregating predictions made by multiple classifiers
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
General Idea
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Original Training data�
….�
D1�
D2�
Dt1�
Dt�
D�
Step 1: �Create Multiple Data Sets�
�
C1�
�
C2�
�
Ct 1�
�
Ct�
Step 2: �Build Multiple Classifiers�
C*�
Step 3: �Combine Classifiers�
Why does it work?
Suppose there are 25 base classifiers
Each classifier has error rate, = 0.35
Assume classifiers are independent
Probability that the ensemble classifier makes a wrong prediction:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Examples of Ensemble Methods
How to generate an ensemble of classifiers?
Bagging
Boosting
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Bagging
Sampling with replacement
Build classifier on each bootstrap sample
Each sample has probability (1 – 1/n)n of being selected
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Boosting
An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records
Initially, all N records are assigned equal weights
Unlike bagging, weights may change at the end of boosting round
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Boosting
Records that are wrongly classified will have their weights increased
Records that are classified correctly will have their weights decreased
Example 4 is hard to classify
Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Example: AdaBoost
Base classifiers: C1, C2, …, CT
Error rate:
Importance of a classifier:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Example: AdaBoost
Weight update:
If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated
Classification:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Illustrating AdaBoost
Data points for training
Initial weights for each data point
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Boosting Round 1�
+�
+�
+�
�
�
�
�
�
�
�
B1�
0.0094�
0.0094�
0.4623�
a = 1.9459�
Illustrating AdaBoost
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Boosting Round 1�
+�
+�
+�
�
�
�
�
�
�
�
Boosting Round 2�
0.3037�
�
�
�
�
�
�
�
�
+�
+�
Boosting Round 3�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
Overall�
+�
+�
+�
�
�
�
�
�
+�
+�
0.0009�
0.0422�
0.0276�
0.1819�
0.0038�
B1�
0.0094�
0.0094�
0.4623�
B2�
B3�
a = 1.9459�
a = 2.9323�
a = 3.8744�
YES
YES
NO
NO
NO
NO
NO
NO
Yes
No
{Married}
{Single,
Divorced}
< 80K
> 80K
Taxable
Income
Marital
Status
Refund
Classification Rules
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Singl
e
90K
Yes
10
NameGive BirthLay EggsCan FlyLive in WaterHave LegsClass
humanyesnononoyesmammals
pythonnoyesnononoreptiles
salmonnoyesnoyesnofishes
whaleyesnonoyesnomammals
frognoyesnosometimesyesamphibians
komodonoyesnonoyesreptiles
batyesnoyesnoyesmammals
pigeonnoyesyesnoyesbirds
catyesnononoyesmammals
leopard sharkyesnonoyesnofishes
turtlenoyesnosometimesyesreptiles
penguinnoyesnosometimesyesbirds
porcupineyesnononoyesmammals
eelnoyesnoyesnofishes
salamandernoyesnosometimesyesamphibians
gila monsternoyesnonoyesreptiles
platypusnoyesnonoyesmammals
owlnoyesyesnoyesbirds
dolphinyesnonoyesnomammals
eaglenoyesyesnoyesbirds
Give
Birth?
Live In
Water?
Can
Fly?
Mammals
Fishes
Amphibians
Birds
Reptiles
Yes
No
Yes
Sometimes
No
Yes
No
Name
Blood Type
Give Birth
Can Fly
Live in Water
Class
lemur
warm
yes
no
no
?
turtle
cold
no
no
sometimes
?
dogfish shark
cold
yes
no
yes
?
Atr1
………
AtrN
Class
A
B
B
C
A
C
B
Set of Stored Cases
k
n
n
c
+
+
=
1
(ii) Step 1
(iii) Step 2
R1
(iv) Step 3
R1
R2
PREDICTED CLASS
AmphibiansFishesReptilesBirdsMammals
ACTUALAmphibians00002
CLASSFishes03000
Reptiles00301
Birds00121
Mammals02104
PREDICTED CLASS
AmphibiansFishesReptilesBirdsMammals
ACTUALAmphibians20000
CLASSFishes02001
Reptiles10300
Birds10030
Mammals00106
Status =
Single
Status =
Divorced
Status =
Married
Income
> 80K
…
Yes: 3
No: 4
{ }
Yes: 0
No: 3
Refund=
No
Yes: 3
No: 4
Yes: 2
No: 1
Yes: 1
No: 0
Yes: 3
No: 1
(a) Generaltospecific
Name
Blood Type
Give Birth
Can Fly
Live in Water
Class
turtle
cold
no
no
sometimes
?
Tid Refund Marital
Status
Taxable
Income
Class
1 Yes
Single
125K
No
2 No Married 100K
No
3 No
Single
70K
No
4 Yes Married 120K
No
5 No Divorced 95K
Yes
6 No Married 60K
No
7 Yes Divorced 220K
No
8 No
Single
85K
Yes
9 No Married 75K
No
10 No
Single
90K
Yes
10
k
n
kp
n
c
+
+
=
Name
Blood Type
Give Birth
Can Fly
Live in Water
Class
human
warm
yes
no
no
mammals
python
cold
no
no
no
reptiles
salmon
cold
no
no
yes
fishes
whale
warm
yes
no
yes
mammals
frog
cold
no
no
sometimes
amphibians
komodo
cold
no
no
no
reptiles
bat
warm
yes
yes
no
mammals
pigeon
warm
no
yes
no
birds
cat
warm
yes
no
no
mammals
leopard shark
cold
yes
no
yes
fishes
turtle
cold
no
no
sometimes
reptiles
penguin
warm
no
no
sometimes
birds
porcupine
warm
yes
no
no
mammals
eel
cold
no
no
yes
fishes
salamander
cold
no
no
sometimes
amphibians
gila monster
cold
no
no
no
reptiles
platypus
warm
no
no
no
mammals
owl
warm
no
yes
no
birds
dolphin
warm
yes
no
yes
mammals
eagle
warm
no
yes
no
birds
Name
Blood Type
Give Birth
Can Fly
Live in Water
Class
hawk
warm
no
yes
no
?
grizzly bear
warm
yes
no
no
?
Refund=No,
Status=Single,
Income=85K
(Class=Yes)
Refund=No,
Status=Single,
Income=90K
(Class=Yes)
Refund=No,
Status = Single
(Class = Yes)
(b) Specifictogeneral
Rulebased Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Classbased Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Married}) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(i) Original Data
n
n
c
=
class = +
class = –
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
–
–
–
–
—
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
+
+
+
+
+
+
+
R1
R3R2
+
+
Rule Set
r1: (P=No,Q=No) ==> –
r2: (P=No,Q=Yes) ==> +
r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> –
r5: (P=Yes,R=Yes,Q=Yes) ==> +
P
QR
Q
++
+
NoNo
No
Yes
Yes
Yes
NoYes
Atr1
………
AtrN
Unseen Case
Unknown record
X
X
X
(a) 1nearest neighbor
(b) 2nearest neighbor
(c) 3nearest neighbor
å
–
=
i
i
i
q
p
q
p
d
2
)
(
)
,
(
X
å
–
=
i
i
i
n
n
n
n
V
V
d
2
2
1
1
2
1
)
,
(
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
å
=
=
D
d
i
i
i
Y
X
Y
X
d
w
w
Y
X
1
2
)
,
(
)
,
(
Tid Refund Marital
Status
Taxable
Income
Cheat
X Yes Single 125K
No
Y No Married 100K
No
10
correctly
predicts
X
times
of
Number
prediction
for
used
is
X
times
of
Number
=
X
w
)
(
)
(
)

(
)

(
A
P
C
P
C
A
P
A
C
P
=
)
(
)
,
(
)

(
)
(
)
,
(
)

(
C
P
C
A
P
C
A
P
A
P
C
A
P
A
C
P
=
=
0002
.
0
20
/
1
50000
/
1
5
.
0
)
(
)
(
)

(
)

(
=
´
=
=
S
P
M
P
M
S
P
S
M
P
)
(
)
(
)

(
)

(
2
1
2
1
2
1
n
n
n
A
A
A
P
C
P
C
A
A
A
P
A
A
A
C
P
K
K
K
=
Tid
Refund
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Singl
e
90K
Yes
10
categorical
categorical
continuous
class
2
2
2
)
(
2
2
1
)

(
ij
ij
i
A
ij
j
i
e
c
A
P
s
m
ps
–
–
=
0072
.
0
)
54
.
54
(
2
1
)

120
(
)
2975
(
2
)
110
120
(
2
=
=
=
–
–
e
No
Income
P
p
P(Refund=YesNo) = 3/7
P(Refund=NoNo) = 4/7
P(Refund=YesYes) = 0
P(Refund=NoYes) = 1
P(Marital Status=SingleNo) = 2/7
P(Marital Status=DivorcedNo)=1/7
P(Marital Status=MarriedNo) = 4/7
P(Marital Status=SingleYes) = 2/7
P(Marital Status=DivorcedYes)=1/7
P(Marital Status=MarriedYes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes:
sample mean=90
sample variance=25
naive Bayes Classifier:
120K)
Income
Married,
No,
Refund
(
=
=
=
X
m
N
mp
N
C
A
P
c
N
N
C
A
P
N
N
C
A
P
c
ic
i
c
ic
i
c
ic
i
+
+
=
+
+
=
=
)

(
:
estimate
–
m
1
)

(
:
Laplace
)

(
:
Original
NameGive BirthCan FlyLive in WaterHave LegsClass
humanyesnonoyesmammals
pythonnonononononmammals
salmonnonoyesnononmammals
whaleyesnoyesnomammals
frognonosometimesyesnonmammals
komodonononoyesnonmammals
batyesyesnoyesmammals
pigeonnoyesnoyesnonmammals
catyesnonoyesmammals
leopard sharkyesnoyesnononmammals
turtlenonosometimesyesnonmammals
penguinnonosometimesyesnonmammals
porcupineyesnonoyesmammals
eelnonoyesnononmammals
salamandernonosometimesyesnonmammals
gila monsternononoyesnonmammals
platypusnononoyesmammals
owlnoyesnoyesnonmammals
dolphinyesnoyesnomammals
eaglenoyesnoyesnonmammals
Give BirthCan FlyLive in WaterHave LegsClass
yesnoyesno?
0027
.
0
20
13
004
.
0
)
(
)

(
021
.
0
20
7
06
.
0
)
(
)

(
0042
.
0
13
4
13
3
13
10
13
1
)

(
06
.
0
7
2
7
2
7
6
7
6
)

(
=
´
=
=
´
=
=
´
´
´
=
=
´
´
´
=
N
P
N
A
P
M
P
M
A
P
N
A
P
M
A
P
X
1
X
2
X
3
Y
1000
1011
1101
1111
0010
0100
0111
0000
X
1
X
2
X
3
Y
Black box
Output
Input
X
1
X
2
X
3
Y
1000
1011
1101
1111
0010
0100
0111
0000
X
1
X
2
X
3
Y
Black box
0.3
0.3
0.3
t=0.4
Output
node
Input
nodes
î
í
ì
=
>
–
+
+
=
otherwise
0
true
is
if
1
)
(
where
)
0
4
.
0
3
.
0
3
.
0
3
.
0
(
3
2
1
z
z
I
X
X
X
I
Y
X
1
X
2
X
3
Y
Black box
w
1
t
Output
node
Input
nodes
w
2
w
3
)
(
t
X
w
I
Y
i
i
i
–
=
å
)
(
t
X
w
sign
Y
i
i
i
–
=
å
Activation
function
g(S
i
)
S
i
O
i
I
1
I
2
I
3
w
i1
w
i2
w
i3
O
i
Neuron iInputOutput
threshold, t
Input
Layer
Hidden
Layer
Output
Layer
x
1
x
2
x
3
x
4
x
5
y
[
]
2
)
,
(
å
–
=
i
i
i
i
X
w
f
Y
E
B
1
B
2
B
1
B
2
B
1
B
2
b
11
b
12
b
21
b
22
margin
B
1
b
11
b
12
0
=
+
·
b
x
w
r
r
1
–
=
+
·
b
x
w
r
r
1
+
=
+
·
b
x
w
r
r
î
í
ì
–
£
+
·
–
³
+
·
=
1
b
x
w
if
1
1
b
x
w
if
1
)
(
r
r
r
r
r
x
f
2


2
Margin
w
r
=
î
í
ì
–
£
+
·
–
³
+
·
=
1
b
x
w
if
1
1
b
x
w
if
1
)
(
i
i
r
r
r
r
r
i
x
f
2


)
(
2
w
w
L
r
=
î
í
ì
+
–
£
+
·
–
³
+
·
=
i
i
i
i
1
b
x
w
if
1
–
1
b
x
w
if
1
)
(
x
x
r
r
r
r
r
i
x
f
÷
ø
ö
ç
è
æ
+
=
å
=
N
i
k
i
C
w
w
L
1
2
2


)
(
x
r
Original
Training data
….
D
1
D
2
D
t1
D
t
D
Step 1:
Create Multiple
Data Sets
C
1
C
2
C
t 1
C
t
Step 2:
Build Multiple
Classifiers
C
*
Step 3:
Combine
Classifiers
å
=
–
=
–
÷
÷
ø
ö
ç
ç
è
æ
25
13
25
06
.
0
)
1
(
25
i
i
i
i
e
e
Original Data
1
2
3
4
5
6
7
8
9
10
Bagging (Round 1)
7
8
10
8
2
5
10
10
5
9
Bagging (Round 2)
1
4
9
1
2
3
2
7
3
2
Bagging (Round 3)
1
8
5
10
5
5
9
6
3
7
Original Data
1
2
3
4
5
6
7
8
9
10
Boosting (Round 1)
7
3
2
8
7
9
4
10
6
3
Boosting (Round 2)
5
4
9
4
2
5
1
7
4
2
Boosting (Round 3)
4
4
8
10
4
5
4
6
3
4
(
)
å
=
¹
=
N
j
j
j
i
j
i
y
x
C
w
N
1
)
(
1
d
e
÷
÷
ø
ö
ç
ç
è
æ
–
=
i
i
i
e
e
a
1
ln
2
1
factor
ion
normalizat
the
is
where
)
(
if
exp
)
(
if
exp
)
(
)
1
(
j
i
i
j
i
i
j
j
j
i
j
i
Z
y
x
C
y
x
C
Z
w
w
j
j
ï
î
ï
í
ì
¹
=
=
–
+
a
a
(
)
å
=
=
=
T
j
j
j
y
y
x
C
x
C
1
)
(
max
arg
)
(
*
d
a
Boosting
Round 1
+++
——
0.00940.00940.4623
B1
= 1.9459
Original
Data
+++
—–
++
0.1
0.10.1
Boosting
Round 1
+++
——
Boosting
Round 2
——–
++
Boosting
Round 3
++++++++++
Overall
+++
—–
++
0.00940.00940.4623
0.3037
0.00090.0422
0.0276
0.1819
0.0038
B1
B2
B3
= 1.9459
= 2.9323
= 3.8744
Data Mining: Introduction
Lecture Notes for Chapter 1
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Lots of data is being collected
and warehoused
Web data, ecommerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Why Mine Data? Commercial Viewpoint
Why Mine Data? Scientific Viewpoint
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
Mining Large Data Sets – Motivation
There is often information “hidden” in the data that is
not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analyzed at all
The Data Gap
Total new disk (TB) since 1995
Number of analysts
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
disks
Units Capacity PBs
1995 89,054 104.8
1996 105,686 183.9
1997 129,281 343.63
1998 143,649 724.36
1999 165,857 1394.6
2000 187,835 2553.7
2001 212,800 4641
2002 239,138 8119
2003 268,227 13027
1995 104.8
1996 183.9
1997 343.63
1998 724.36
1999 1394.6
2000 2553.7
2001 4641
2002 8119
2003 13027
disks
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
chart data gap
1995 1995
1996 1996
1997 1997
1998 1998
1999 1999
26535
105700
27229
227400
27245
425330
27309
891970
25953
1727000
chart data gap 2
1995 1995
1996 1996
1997 1997
1998 1998
1999 1999
26535
105700
27229
333100
27245
758430
27309
1650400
25953
3377400
data gap
Ph.D. Petabytes Terabytes Total TBs PBs
1995 105.7 105700 105700 105.7
1996 227.4 227400 333100 333.1
1997 425.33 425330 758430 758.43
1998 891.97 891970 1650400 1650.4
1999 1727 1727000 3377400 3377.4
2000 5792 5792000 9169400 9169.4
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
Science and engineering Ph.D.s, total 22,868 24,023 24,675 25,443 26,205 26,535 27,229 27,245 27,309 25,953
105700 333100 758430 1650400 3377400
105700 333100 758430 1650400 3377400
Sheet3
What is Data Mining?
Many Definitions
Nontrivial extraction of implicit, previously unknown and potentially useful information from data
Exploration & analysis, by automatic or
semiautomatic means, of
large quantities of data
in order to discover
meaningful patterns
What is (not) Data Mining?
What is Data Mining?
Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
What is not Data Mining?
Look up phone number in phone directory
Query a Web search engine for information about “Amazon”
Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
Traditional Techniques
may be unsuitable due to
Enormity of data
High dimensionality
of data
Heterogeneous,
distributed nature
of data
Origins of Data Mining
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database systems
Data Mining Tasks
Prediction Methods
Use some variables to predict unknown or future values of other variables.
Description Methods
Find humaninterpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks…
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification Example
categorical
categorical
continuous
class
Training
Set
Learn
Classifier
Test
Set
Model
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cellphone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute.
Collect various demographic, lifestyle, and companyinteraction related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its accountholder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an account.
Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a competitor.
Approach:
Use detailed record of transactions with each of the past and present customers, to find attributes.
How often the customer calls, where he calls, what timeofthe day he calls most, his financial status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
Classification: Application 4
Sky Survey Cataloging
Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).
3000 images with 23,040 x 23,040 pixels per image.
Approach:
Segment the image.
Measure image attributes (features) – 40 of them per object.
Model the class based on these features.
Success Story: Could find 16 new high redshift quasars, some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classifying Galaxies
Early
Intermediate
Late
Data Size:
72 million stars, 20 million galaxies
Object Catalog: 9 GB
Image Database: 150 GB
Class:
Stages of Formation
Attributes:
Image features,
Characteristics of light waves received, etc.
Courtesy: http://aps.umn.edu
Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problemspecific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3D space.
Intracluster distances
are minimized
Intercluster distances
are maximized
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to each other based on the important terms appearing in them.
Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
Illustrating Document Clustering
Clustering Points: 3204 Articles of Los Angeles Times.
Similarity Measure: How many words are common in these documents (after some word filtering).
Category
Total Articles
Correctly Placed
Financial
555
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
Clustering of S&P 500 Stock Data
Observe Stock Movements every day.
Clustering points: Stock{UP/DOWN}
Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day.
We used association rules to quantify a similarity measure.
Discovered Clusters
Industry Group
1
AppliedMatlDOWN,BayNetworkDown,3COMDOWN,
CabletronSysDOWN,CISCODOWN,HPDOWN,
DSCCommDOWN,INTELDOWN,LSILogicDOWN,
MicronTechDOWN,TexasInstDown,TellabsIncDown,
NatlSemiconductDOWN,OraclDOWN,SGIDOWN,
SunDOWN
Technology1DOWN
2
AppleCompDOWN,AutodeskDOWN,DECDOWN,
ADVMicroDeviceDOWN,AndrewCorpDOWN,
ComputerAssocDOWN,CircuitCityDOWN,
CompaqDOWN, EMCCorpDOWN, GenInstDOWN,
MotorolaDOWN,MicrosoftDOWN,ScientificAtlDOWN
Technology2DOWN
3
FannieMaeDOWN,FedHomeLoanDOWN,
MBNACorpDOWN,MorganStanleyDOWN
FinancialDOWN
4
BakerHughesUP,DresserIndsUP,HalliburtonHLDUP,
LouisianaLandUP,PhillipsPetroUP,UnocalUP,
SchlumbergerUP
OilUP
Association Rule Discovery: Definition
Given a set of records each of which contain some number of items from a given collection;
Produce dependency rules which will predict occurrence of an item based on occurrences of other items.
Rules Discovered:
{Milk} –> {Coke}
{Diaper, Milk} –> {Beer}
TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk
Association Rule Discovery: Application 1
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } –> {Potato Chips}
Potato Chips as consequent => Can be used to determine what should be done to boost its sales.
Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
Association Rule Discovery: Application 2
Supermarket shelf management.
Goal: To identify items that are bought together by sufficiently many customers.
Approach: Process the pointofsale data collected with barcode scanners to find dependencies among items.
A classic rule —
If a customer buys diaper and milk, then he is very likely to buy beer.
So, don’t be surprised if you find sixpacks stacked next to diapers!
Association Rule Discovery: Application 3
Inventory Management:
Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households.
Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the cooccurrence patterns.
Sequential Pattern Discovery: Definition
Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.
Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints.
(A B) (C) (D E)
<= ms
<= xg
>ng
<= ws
(A B) (C) (D E)
Sequential Pattern Discovery: Examples
In telecommunications alarm logs,
(Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) > (Fire_Alarm)
In pointofsale transaction sequences,
Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) –> (Perl_for_dummies,Tcl_Tk)
Athletic Apparel Store:
(Shoes) (Racket, Racketball) –> (Sports_Jacket)
Regression
Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based on advetising expenditure.
Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
Time series prediction of stock market indices.
Deviation/Anomaly Detection
Detect significant deviations from normal behavior
Applications:
Credit Card Fraud Detection
Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections per day
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
19951996199719981999
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Category
Total
Articles
Correctly
Placed
Financial
555
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
Discovered Clusters
Industry Group
1
AppliedMatlDOWN,BayNetworkDown,3COMDOWN,
CabletronSysDOWN,CISCODOWN,HPDOWN,
DSCCommDOWN,INTELDOWN,LSILogicDOWN,
MicronTechDOWN,TexasInstDown,TellabsIncDown,
NatlSemiconductDOWN,OraclDOWN,SGIDOWN,
SunDOWN
Technology1DOWN
2
AppleComp
DOWN,AutodeskDOWN,DECDOWN,
ADVMicroDevice
DOWN,AndrewCorpDOWN,
ComputerAssocDOWN,CircuitCityDOWN,
CompaqDOWN, EMCCorpDOWN,
GenInstDOWN,
MotorolaDOWN,MicrosoftDOWN,ScientificAtlDOWN
Technology2DOWN
3
FannieMae
DOWN,FedHomeLoanDOWN,
MBNACorp
DOWN,MorganStanleyDOWN
FinancialDOWN
4
BakerHughesUP,DresserIndsUP,HalliburtonHLDUP,
LouisianaLandUP,PhillipsPetroUP,UnocalUP,
SchlumbergerUP
OilUP
TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
What is Data?
Collection of data objects and their attributes
An attribute is a property or characteristic of an object
Examples: eye color of a person, temperature, etc.
Attribute is also known as variable, field, characteristic, or feature
A collection of attributes describe an object
Object is also known as record, point, case, sample, entity, or instance
Attributes
Objects
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Attribute Values
Attribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute values
Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values
Example: Attribute values for ID and age are integers
But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Measurement of Length
The way you measure an attribute is somewhat may not match the attributes properties.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Types of Attributes
There are different types of attributes
Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a scale from 110), grades, height in {tall, medium, short}
Interval
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio
Examples: temperature in Kelvin, length, time, counts
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Properties of Attribute Values
The type of an attribute depends on which of the following properties it possesses:
Distinctness: =
Order: < >
Addition: + –
Multiplication: * /
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Attribute Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best},
grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, – )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson’s correlation, t and F tests
Ratio
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An order preserving change of values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}.
Interval
new_value =a * old_value + b where a and b are constants
Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree).
Ratio
new_value = a * old_value
Length can be measured in meters or feet.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Discrete and Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a finite number of digits.
Continuous attributes are typically represented as floatingpoint variables.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Types of data sets
Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Important Characteristics of Structured Data
Dimensionality
Curse of Dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Record Data
Data that consists of a collection of records, each of which consists of a fixed set of attributes
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Data Matrix
If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute
Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Document Data
Each document becomes a `term’ vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term occurs in the document.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Document 1�
season�
timeout�
lost�
win�
game�
score�
ball�
play�
coach�
team�
Document 2�
Document 3�
3�
0�
5�
0�
2�
6�
0�
2�
0�
2�
0�
0�
7�
0�
2�
1�
0�
0�
3�
0�
0�
1�
0�
0�
1�
2�
2�
0�
3�
0�
Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk
Graph Data
Examples: Generic graph and HTML Links
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Chemical Data
Benzene Molecule: C6H6
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Ordered Data
Sequences of transactions
An element of the sequence
Items/Events
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Ordered Data
Genomic sequence data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Ordered Data
SpatioTemporal Data
Average Monthly Temperature of land and ocean
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:
Noise and outliers
missing values
duplicate data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen
Two Sine Waves
Two Sine Waves + Noise
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Outliers
Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their probabilities)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Aggregation
Combining two or more attributes (or objects) into a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More “stable” data
Aggregated data tends to have less variability
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Aggregation
Standard Deviation of Average Monthly Precipitation
Standard Deviation of Average Yearly Precipitation
Variation of Precipitation in Australia
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sampling
Sampling is the main technique employed for data selection.
It is often used for both the preliminary investigation of the data and the final data analysis.
Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming.
Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sampling …
The key principle for effective sampling is the following:
using a sample will work almost as well as using the entire data sets, if the sample is representative
A sample is representative if it has approximately the same property (of interest) as the original set of data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item
Sampling without replacement
As each item is selected, it is removed from the population
Sampling with replacement
Objects are not removed from the population as they are selected for the sample.
In sampling with replacement, the same object can be picked up more than once
Stratified sampling
Split the data into several partitions; then draw random samples from each partition
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sample Size
8000 points 2000 Points 500 Points
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sample Size
What sample size is necessary to get at least one object from each of 10 groups.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Curse of Dimensionality
When dimensionality increases, data becomes increasingly sparse in the space that it occupies
Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful
Randomly generate 500 points
Compute difference between max and min distance between any pair of points
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce noise
Techniques
Principle Component Analysis
Singular Value Decomposition
Others: supervised and nonlinear techniques
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Dimensionality Reduction: PCA
Goal is to find a projection that captures the largest amount of variation in data
x2
x1
e
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Dimensionality Reduction: PCA
Find the eigenvectors of the covariance matrix
The eigenvectors define the new space
x2
x1
e
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Dimensionality Reduction: ISOMAP
Construct a neighbourhood graph
For each pair of points in the graph, compute the shortest path distances – geodesic distances
By: Tenenbaum, de Silva, Langford (2000)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Dimensionality Reduction: PCA
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information contained in one or more other attributes
Example: purchase price of a product and the amount of sales tax paid
Irrelevant features
contain no information that is useful for the data mining task at hand
Example: students’ ID is often irrelevant to the task of predicting students’ GPA
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Feature Subset Selection
Techniques:
Bruteforce approch:
Try all possible feature subsets as input to data mining algorithm
Embedded approaches:
Feature selection occurs naturally as part of the data mining algorithm
Filter approaches:
Features are selected before data mining algorithm is run
Wrapper approaches:
Use the data mining algorithm as a black box to find best subset of attributes
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Feature Creation
Create new attributes that can capture the important information in a data set much more efficiently than the original attributes
Three general methodologies:
Feature Extraction
domainspecific
Mapping Data to New Space
Feature Construction
combining features
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Mapping Data to a New Space
Two Sine Waves
Two Sine Waves + Noise
Frequency
Fourier transform
Wavelet transform
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Discretization Using Class Labels
Entropy based approach
3 categories for both x and y
5 categories for both x and y
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Discretization Without Using Class Labels
Data
Equal interval width
Equal frequency
Kmeans
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Attribute Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
Simple functions: xk, log(x), ex, x
Standardization and Normalization
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Euclidean Distance
Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
Standardization is necessary, if scales differ.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Euclidean Distance
Distance Matrix
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sheet1
point x y
0 2
p2 2 0
p3 3 1
p4 5 1
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
p1
Sheet2
Sheet3
Sheet1
point x y
0 2
p2 2 0
p3 3 1
p4 5 1
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
p1
Sheet2
Sheet3
Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Minkowski Distance: Examples
r = 1. City block (Manhattan, taxicab, L1 norm) distance.
A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors
r = 2. Euclidean distance
r . “supremum” (Lmax norm, L norm) distance.
This is the maximum difference between any component of the vectors
Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Minkowski Distance
Distance Matrix
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sheet1
point x y
0 2
p2 2 0
p3 3 1
p4 5 1
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
p1
Sheet2
Sheet3
Sheet1
point x y
0 2
p2 2 0
p3 3 1
p4 5 1
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
p1
Sheet2
Sheet3
Sheet1
point x y
0 2
p2 2 0
p3 3 1
p4 5 1
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
p1
Sheet2
Sheet3
Sheet1
point x y
0 2
p2 2 0
p3 3 1
p4 5 1
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
p1
Sheet2
Sheet3
Mahalanobis Distance
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
is the covariance matrix of the input data X
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Mahalanobis Distance
Covariance Matrix:
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Common Properties of a Distance
Distances, such as the Euclidean distance, have some well known properties.
d(p, q) 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
d(p, q) = d(q, p) for all p and q. (Symmetry)
d(p, r) d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
A distance that satisfies these properties is a metric
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Common Properties of a Similarity
Similarities, also have some well known properties.
s(p, q) = 1 (or maximum similarity) only if p = q.
s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects), p and q.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Similarity Between Binary Vectors
Common situation is that objects, p and q, have only binary attributes
Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of notbothzero attributes values
= (M11) / (M01 + M10 + M11)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Cosine Similarity
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / d1 d2 ,
where indicates vector dot product and  d  is the length of vector d.
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
d1 = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
d2 = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Extended Jaccard Coefficient (Tanimoto)
Variation of Jaccard for continuous or count attributes
Reduces to Jaccard for binary attributes
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Correlation
Correlation measures the linear relationship between objects
To compute correlation, we standardize data objects, p and q, and then take their dot product
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Visually Evaluating Correlation
Scatter plots showing the similarity from –1 to 1.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
General Approach for Combining Similarities
Sometimes attributes are of many different types, but an overall similarity is needed.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Using Weights to Combine Similarities
May not want to treat all attributes the same.
Use weights wk which are between 0 and 1 and sum to 1.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Density
Densitybased clustering require a notion of density
Examples:
Euclidean density
Euclidean density = number of points per unit volume
Probability density
Graphbased density
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Euclidean Density – Cellbased
Simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Euclidean Density – Centerbased
Euclidean density is the number of points within a specified radius of the point
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Singl
e
90K
Yes
10
Document 1
s
e
a
s
o
n
t
i
m
e
o
u
t
l
o
s
t
w
i
n
g
a
m
e
s
c
o
r
e
b
a
l
l
p
l
a
y
c
o
a
c
h
t
e
a
m
Document 2
Document 3
3050260202
0
0
702100300
100122030
TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk
5
2
1
2
5
Data Mining
Graph Partitioning
Parallel Solution of Sparse Linear System of Equations
NBody Computation and Dense Linear System Solvers
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
1
2
3
5
5
7
8
15
10
4
A
B
C
D
E
Dimensions = 10
Dimensions = 40
Dimensions = 80
Dimensions = 120
Dimensions = 160
Dimensions = 206
å
=
–
=
n
k
k
k
q
p
dist
1
2
)
(
0
1
2
3
0
1
2
3
4
5
6
p1
p2
p3
p4
pointxy
p102
p220
p331
p451
p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
r
n
k
r
k
k
q
p
dist
1
1
)


(
å
=
–
=
pointxy
p102
p220
p331
p451
L1p1p2p3p4
p10446
p24024
p34202
p46420
L2p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
L
p1p2p3p4
p10235
p22013
p33102
p45320
T
q
p
q
p
q
p
s
mahalanobi
)
(
)
(
)
,
(
1
–
å
–
=
–
å
=
–
–
–
=
S
n
i
k
ik
j
ij
k
j
X
X
X
X
n
1
,
)
)(
(
1
1
ú
û
ù
ê
ë
é
=
S
3
.
0
2
.
0
2
.
0
3
.
0
)
(
/
))
(
(
p
std
p
mean
p
p
k
k
–
=
¢
)
(
/
))
(
(
q
std
q
mean
q
q
k
k
–
=
¢
q
p
q
p
n
correlatio
¢
·
¢
=
)
,
(
Data
Mining: Exploring Data
Lecture Notes for Chapter 3
Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler
Look for accompanying R
code on the course web site.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›
Techniques Used In Data Exploration
In EDA, as originally defined by Tukey
– The focus was on visualization
– Clustering and anomaly detection were viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory
In our discussion of data exploration, we focus on
– Summary statistics
–
Visualization
– Online Analytical Processing (OLAP)
What is data exploration?
• Key motivations of data exploration include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data analysis
tools
• Related to the area of Exploratory Data Analysis (EDA)
– Created by statistician John Tukey
– Seminal book is “Exploratory Data Analysis” by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
A preliminary exploration of the data to
better understand its characteristics.
Iris Sample Data Set
• Many of the exploratory data techniques are illustrated
with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician R.A. Fisher
– Three flower types (classes):
•
Setosa
•
Virginica
• Versicolour
– Four (nonclass) attributes
• Sepal width and length
• Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast
National Technical Center, Chester, PA.
Courtesy of USDA NRCS Wetland Science
Institute.
Summary Statistics
• Summary statistics are numbers that summarize
properties of the data
– Summarized properties include location and
spread for continuous data
• Examples: location – mean
spread – standard deviation
Most summary statistics can be calculated in a
single pass through the data
Frequency and Mode
• The frequency of an attribute value is the
percentage of time the value occurs in the
data set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
• The mode of an attribute is the most frequent attribute
value
• The notions of frequency and mode are typically used
with categorical data
Measures of Location: Mean and
Median
• The mean is the most common measure of the
location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also
commonly used.
Measures of Spread: Range and Variance
• Range is the difference between the max and min
• The variance or standard deviation is the most
common measure of the spread of a set of points.
• However, this is also sensitive to outliers, so that
other measures are often used.
Percentiles
x p
x p
Median – 50% of the
cases has a smaller value & 50%
are larger
Multivariate Summary Statistics
Object x
1
x
2
1 12 15
2 2 4
… … …
m 18 4
• Covariance between
features i and j
• Correlation
si is the variance of
feature i
sij=
1
m−1∑k=1
m
( xki− x̄i)(xkj− x̄ j)
r ij=
sij
si s j
Topics
• Exploratory Data Analysis
• Summary Statistics
• Visualization
Visualization
Visualization is the conversion of data into a visual
or tabular format so that the characteristics of the
data and the relationships among data items or
attributes can be analyzed or reported.
• Visualization of data is one of the most powerful and
appealing techniques for data exploration.
– Humans have a well developed ability to analyze
large amounts of information that is presented
visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
Example: Sea Surface Temperature
• The following shows the Sea Surface Temperature
(SST) for July 1982
– Tens of thousands of data points are
summarized in a single figure
Representation
• Is the mapping of information to a visual format
• Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and colors.
• Example:
– Objects are often represented as points
– Their attribute values can be represented as the
position of the points or the characteristics of the
points, e.g., color, size, and shape
– If position is used, then the relationships of points,
i.e., whether they form groups or a point is an
outlier, is easily perceived.
Arrangement
• Is the placement of visual elements within a
display
• Can make a large difference in how easy it is to
understand the data
• Example:
Selection
• Is the elimination or the deemphasis of certain objects and
attributes
• Selection may involve the choosing a subset of attributes
– Dimensionality reduction is often used to reduce the
number of dimensions to two or three
– Alternatively, pairs of attributes can be considered
• Selection may also involve choosing a subset of objects
– A region of the screen can only show so many points
– Can sample, but want to preserve points in sparse areas
Histograms
• Usually shows the distribution of values of a single variable
• Divide the values into bins and show a bar plot of the number of objects in each bin.
• The height of each bar indicates the number of objects
• Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)
Empirical Cumulative Distribution
Function (ECDF)
• Probability Density
Function (
): describes
the relative likelihood for
this random variable to take
on a given value
• Cumulative Distribution
Function (CDF): Shows
the distribution of data as
the fraction of points that
are less than this value.
CDF
Example: ECDF
TwoDimensional Histograms
• Show the joint distribution of the values of two attributes
• Example: petal width and petal length
– What does this tell us?
Box Plots
• Invented by J. Tukey
• Another way of displaying the distribution of data
• Following figure shows the basic part of a box plot
outlier
25th percentile
– 1.5
IQR
25th percentile
75th percentile
50th percentile
75th percentile
+ 1.5 IQR
IQR
Q1 Q3
IQR
Median
Q3 + 1.5 × IQRQ1 − 1.5 × IQR
−0.6745 σ 0.6745 σ 2.698 σ−2.698 σ
50%24.65% 24.65%
68.27% 15.73%15.73%
−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ
−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ
−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ
Example of Box Plots
• Box plots can be used to compare attributes
Scatter Plots
– Attributes values determine the position
– Twodimensional scatter plots most common,
but can have threedimensional scatter plots
Often additional attributes can be displayed by
using the size, shape, and color of the markers
that represent the objects
– It is useful to have arrays of scatter plots can
compactly summarize the relationships of
several pairs of attributes
• See example on the next slide
Scatter Plot Array of Iris Attributes
Contour Plots
– Useful when a continuous attribute is measured
on a spatial grid
– They partition the plane into regions of similar
values
– The contour lines that form the boundaries of
these regions connect points with equal values
– The most common example is contour maps of
elevation
– Can also display temperature, rainfall, air
pressure, etc.
An example for Sea Surface Temperature (SST) is
provided on the next slide
Contour Plot Example: SST Dec, 1998
Celsius
Matrix Plots
– Can plot a data matrix
– Can be useful when objects are sorted
according to class
– Typically, the attributes are normalized to
prevent one attribute from dominating the plot
– Plots of similarity or distance matrices can also
be useful for visualizing the relationships
between objects
Visualization of the Iris Data Matrix
standard
deviation
Deviation form feature mean
Visualization of the Iris Correlation Matrix
Parallel Coordinates
– Used to plot the attribute values of high
dimensional data
– Instead of using perpendicular axes, use a set of
parallel axes
– The attribute values of each object are plotted as a
point on each corresponding coordinate axis and
the points are connected by a line
– Thus, each object is represented as a line
– Often, the lines representing a distinct class of
objects group together, at least for some attributes
– Ordering of attributes is important in seeing such
groupings
Parallel Coordinates Plots for Iris Data
Reordered features
Other Visualization Techniques
Star Plots
– Similar approach to parallel coordinates, but axes radiate
from a central point
– The line connecting the values of an object is a polygon
Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute with a characteristic
of a face
– The values of each attribute determine the appearance of the
corresponding facial characteristic
– Each object becomes a separate face
– Relies on human’s ability to distinguish faces
Star Plots for Iris Data
Setosa
Versicolor
Virginica
Chernoff Faces for Iris Data
Setosa
Versicolor
Virginica
OLAP
• OnLine Analytical Processing (OLAP) was
proposed by E. F. Codd, the father of the
relational database.
• Relational databases put data into tables, while
OLAP uses a multidimensional array
representation.
– Such representations of data previously existed in
statistics and other fields
• There are a number of data analysis and data
exploration operations that are easier with such
a data representation.
Creating a Multidimensional Array
• Two key steps in converting tabular data into a
multidimensional array.
– First, identify which attributes are to be the dimensions
and which attribute is to be the target attribute whose
values appear as entries in the multidimensional array.
• The attributes used as dimensions must have discrete values
• The target value is typically a count or continuous value, e.g.,
the cost of an item
• Can have no target variable at all except the count of objects
that have the same set of attribute values
– Second, find the value of each entry in the
multidimensional array by summing the values (of the
target attribute) or count of all objects that have the
attribute values corresponding to that entry.
Example: Iris data
• We show how the attributes, petal length, petal
width, and species type can be converted to a
multidimensional array
– First, we discretized the petal width and length to
have categorical values: low, medium, and high
– We get the following table – note the count
attribute
Example: Iris data (continued)
• Each unique tuple of petal width, petal length,
and species type identifies one element of the
array.
• This element is assigned the corresponding
count value.
• The figure illustrates
the result.
• All nonspecified
tuples are 0.
Example: Iris data (continued)
• Slices of the multidimensional array are shown
by the following crosstabulations
• What do these tables tell us?
OLAP Operations: Data Cube
• The key operation of a OLAP is the formation of a
data cube
• A data cube is a multidimensional representation
of data, together with all possible aggregates.
• By all possible aggregates, we mean the aggregates
that result by selecting a proper subset of the
dimensions and summing over all remaining
dimensions.
• For example, if we choose the species type
dimension of the Iris data and sum over all other
dimensions, the result will be a onedimensional
entry with three entries, each of which gives the
number of flowers of each type.
Data Cube Example
• Consider a data set that records the sales of
products at a number of company stores at
various dates.
• This data can be represented
as a 3 dimensional array
• There are 3 twodimensional
aggregates (3 choose 2 ),
3 onedimensional aggregates,
and 1 zerodimensional
aggregate (the overall total)
Data Cube Example (continued)
• The following figure table shows one of the two
dimensional aggregates, along with two of the
onedimensional aggregates, and the overall
total
OLAP Operations: Slicing and Dicing
• Slicing is selecting a group of cells from the
entire multidimensional array by specifying a
specific value for one or more dimensions.
• Dicing involves selecting a subset of cells by
specifying a range of attribute values.
– This is equivalent to defining a subarray from the
complete array.
• In practice, both operations can also be
accompanied by aggregation over some
dimensions.
OLAP Operations: Rollup and Drilldown
• Attribute values often have a hierarchical
structure.
– Each date is associated with a year, month, and week.
– A location is associated with a continent, country, state
(province, etc.), and city.
– Products can be divided into various categories, such as
clothing, electronics, and furniture.
• Note that these categories often nest and form a
tree or lattice
– A year contains months which contains day
– A country contains a state which contains a city
OLAP Operations: Rollup and Drilldown
• This hierarchical structure gives rise to the roll
up and drilldown operations.
– For sales data, we can aggregate (roll up) the sales
across all the dates in a month.
– Conversely, given a view of the data where the time
dimension is broken into months, we could split the
monthly sales totals (drill down) into daily sales
totals.
– Likewise, we can drill down or roll up on the
location or product ID attributes.