Home Uncategorized Data Mining Project

Uncategorized

Data Mining Project

Project: Need it in 20 hours

You have been asked by management (manufacturing, healthcare, retail, financial, and etc. ) to create a demo using a data analytic or BI tool. It is your responsibility to download and produce outputs using one of the tools. You will need to focus your results on the data set you select.

Ensure to address at least one topic covered in Chapters 1-5 with the outputs. The paper should include the following as Header sections.

Introduction
History of Tool [Discuss the benefits and limitations]
Review of the Data [What are you reviewing?]
Exploring the Data with the tool
Classifications Basic Concepts and Decision Trees
Classifications Alternative Techniques
Summary of Results
References (minimum of 2 – each must be cited at least once (of course)).

Ensure you adhere to APA7

Types of Data Analytic Tools

https://www.octoparse.com/blog/top-30-big-data-tools-for-data-analysis/

Excel with Solver, but has limitations
R Studio
Tableau Public has a free trial
Microsoft Power BI
Search for others with trial options

Examples of Dataset

https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016/#4b3e96f1b54d

Discussion : Need in 12 hours:

This week, I want you to post the particular data analytics tool that you used for the midterm project. Provide the benefits and limitations of the tool.

Replies and Explanation: Need in 2 days

You need to make the initial post and also respond to two of your peers. You’ll also need to go back and do the midterm project and explain why no one told me it wasn’t available :o|

Please include the Data Analytic Vendor Name and URL. It will also be helpful to list the pricing and trial options.

Data Mining
Classification: Alternative Techniques

Lecture Notes for Chapter 5
Introduction to Data Mining
by
Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 *
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Rule-Based Classifier
Classify records by using a collection of “if…then…” rules

Rule: (Condition)  y
where
Condition is a conjunctions of attributes
y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
(Blood Type=Warm)  (Lay Eggs=Yes)  Birds
(Taxable Income < 50K)  (Refund=Yes)  Evade=No (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002 Rule-based Classifier (Example) R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002 Application of Rule-Based Classifier A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Rule Coverage and Accuracy
Coverage of a rule:
Fraction of records that satisfy the antecedent of a rule
Accuracy of a rule:
Fraction of records that satisfy both the antecedent and consequent of a rule
(Status=Single)  No
Coverage = 40%, Accuracy = 50%
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Tid

Refund

Marital

Status

Taxable

Income

Class

Yes

Single

125K

Married

100K

Single

70K

Yes

Married

120K

Divorced

95K

Yes

Married

60K

Yes

Divorced

220K

Single

85K

Yes

Married

75K

Single

90K

Yes

How does Rule-based Classifier Work?
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Characteristics of Rule-Based Classifier
Mutually exclusive rules
Classifier contains mutually exclusive rules if the rules are independent of each other
Every record is covered by at most one rule
Exhaustive rules
Classifier has exhaustive coverage if it accounts for every possible combination of attribute values
Each record is covered by at least one rule

From Decision Trees To Rules
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Rules Can Be Simplified
Initial Rule: (Refund=No)  (Status=Married)  No
Simplified Rule: (Status=Married)  No
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Tid

Refund

Marital

Status

Taxable

Income

Cheat

Yes

Single

125K

Married

100K

Single

70K

Yes

Married

120K

Divorced

95K

Yes

Married

60K

Yes

Divorced

220K

Single

85K

Yes

Married

75K

Single

90K

Yes

Effect of Rule Simplification
Rules are no longer mutually exclusive
A record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set – use voting schemes
Rules are no longer exhaustive
A record may not trigger any rules
Solution?
Use a default class

Ordered Rule Set
Rules are rank ordered according to their priority
An ordered rule set is known as a decision list
When a test record is presented to the classifier
It is assigned to the class label of the highest ranked rule it has triggered
If none of the rules fired, it is assigned to the default class
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Rule Ordering Schemes
Rule-based ordering
Individual rules are ranked based on their quality
Class-based ordering
Rules that belong to the same class appear together

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Rule-based Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
�
Class-based Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No
(Refund=No, Marital Status={Married}) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes
�

Building Classification Rules
Direct Method:
Extract rules directly from data
e.g.: RIPPER, CN2, Holte’s 1R
Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules

Direct Method: Sequential Covering
Start from an empty rule
Grow a rule using the Learn-One-Rule function
Remove training records covered by the rule
Repeat Step (2) and (3) until stopping criterion is met
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Example of Sequential Covering
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Example of Sequential Covering…
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Aspects of Sequential Covering
Rule Growing
Instance Elimination
Rule Evaluation
Stopping Criterion
Rule Pruning

Rule Growing
Two common strategies

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(a) General-to-specific�
Refund= No�
Status = Single�
Status = Divorced�
Status = Married�
Income �> 80K �
…�
{ }�
Yes: 0
No: 3�
Yes: 3
No: 4�
Yes: 3
No: 4�
Yes: 2
No: 1�
Yes: 1
No: 0�
Yes: 3
No: 1�

Refund=No,�Status=Single,�Income=85K
(Class=Yes)�
Refund=No,�Status=Single,�Income=90K
(Class=Yes)�
Refund=No,�Status = Single�(Class = Yes)�
(b) Specific-to-general�

Rule Growing (Examples)
CN2 Algorithm:
Start from an empty conjunct: {}
Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …
Determine the rule consequent by taking majority class of instances covered by the rule
RIPPER Algorithm:
Start from an empty rule: {} => class
Add conjuncts that maximizes FOIL’s information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]
where t: number of positive instances covered by both R0 and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
p1: number of positive instances covered by R1
n1: number of negative instances covered by R1
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Instance Elimination
Why do we need to eliminate instances?
Otherwise, the next rule is identical to previous rule
Why do we remove positive instances?
Ensure that the next rule is different
Why do we remove negative instances?
Prevent underestimating accuracy of rule
Compare rules R2 and R3 in the diagram

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
�
class = +�
class = -�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
-�
-�
-�
-�
-�
-�
-�
-�
-�
-�
-�
-�
-�
-�
+�
-�
-�
-�
-�
-�
-�
-�
+�
+�
+�
+�
+�
+�
+�
R1�
R3�
R2�
+�

Rule Evaluation
Metrics:
Accuracy

Laplace

M-estimate
n : Number of instances covered by rule
nc : Number of instances covered by rule
k : Number of classes
p : Prior probability
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Stopping Criterion and Rule Pruning
Stopping criterion
Compute the gain
If gain is not significant, discard the new rule
Rule Pruning
Similar to post-pruning of decision trees
Reduced Error Pruning:
Remove one of the conjuncts in the rule
Compare error rate on validation set before and after pruning
If error improves, prune the conjunct

Summary of Direct Method
Grow a single rule
Remove Instances from rule
Prune the rule (if necessary)
Add rule to Current Rule Set
Repeat

Direct Method: RIPPER
For 2-class problem, choose one of the classes as positive class, and the other as negative class
Learn rules for positive class
Negative class will be default class
For multi-class problem
Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class)
Learn the rule set for smallest class first, treat the rest as negative class
Repeat with next smallest class as positive class

Direct Method: RIPPER
Growing a rule:
Start from empty rule
Add conjuncts as long as they improve FOIL’s information gain
Stop when rule no longer covers negative examples
Prune the rule immediately using incremental reduced error pruning
Measure for pruning: v = (p-n)/(p+n)
p: number of positive examples covered by the rule in
the validation set
n: number of negative examples covered by the rule in
the validation set
Pruning method: delete any final sequence of conditions that maximizes v

Direct Method: RIPPER
Building a Rule Set:
Use sequential covering algorithm
Finds the best rule that covers the current set of positive examples
Eliminate both positive and negative examples covered by the rule
Each time a rule is added to the rule set, compute the new description length
stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far

Direct Method: RIPPER
Optimize the rule set:
For each rule r in the rule set R
Consider 2 alternative rules:
Replacement rule (r*): grow new rule from scratch
Revised rule(r’): add conjuncts to extend the rule r
Compare the rule set for r against the rule set for r*
and r’
Choose rule set that minimizes MDL principle
Repeat rule generation and rule optimization for the remaining positive examples

Indirect Methods
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
P�
Rule Set
r1: (P=No,Q=No) ==> –
r2: (P=No,Q=Yes) ==> +
r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> –
r5: (P=Yes,R=Yes,Q=Yes) ==> +�
Q�
R�
Q�
-�
+�
+�
-�
+�
No�
No�
No�
No�
Yes�
Yes�
Yes�
Yes�

Indirect Method: C4.5rules
Extract rules from an unpruned decision tree
For each rule, r: A  y,
consider an alternative rule r’: A’  y where A’ is obtained by removing one of the conjuncts in A
Compare the pessimistic error rate for r against all r’s
Prune if one of the r’s has lower pessimistic error rate
Repeat until we can no longer improve generalization error

Indirect Method: C4.5rules
Instead of ordering the rules, order subsets of rules (class ordering)
Each subset is a collection of rules with the same rule consequent (class)
Compute description length of each subset
Description length = L(error) + g L(model)
g is a parameter that takes into account the presence of redundant attributes in a rule set
(default value = 0.5)

Example
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

animals2

Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class

human yes no no no yes mammals

python no yes no no no reptiles

salmon no yes no yes no fishes

whale yes no no yes no mammals

frog no yes no sometimes yes amphibians

komodo no yes no no yes reptiles

bat yes no yes no yes mammals

pigeon no yes yes no yes birds

cat yes no no no yes mammals

leopard shark yes no no yes no fishes

turtle no yes no sometimes yes reptiles

penguin no yes no sometimes yes birds

porcupine yes no no no yes mammals

eel no yes no yes no fishes

salamander no yes no sometimes yes amphibians

gila monster no yes no no yes reptiles

platypus no yes no no yes mammals

owl no yes yes no yes birds

dolphin yes no no yes no mammals

eagle no yes yes no yes birds

C4.5 versus C4.5rules versus RIPPER
C4.5rules:
(Give Birth=No, Can Fly=Yes)  Birds
(Give Birth=No, Live in Water=Yes)  Fishes
(Give Birth=Yes)  Mammals
(Give Birth=No, Can Fly=No, Live in Water=No)  Reptiles
( )  Amphibians
RIPPER:
(Live in Water=Yes)  Fishes
(Have Legs=No)  Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
 Reptiles
(Can Fly=Yes,Give Birth=No)  Birds
()  Mammals
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

C4.5 versus C4.5rules versus RIPPER
C4.5 and C4.5rules:
RIPPER:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sheet1

PREDICTED CLASS

Amphibians Fishes Reptiles Birds Mammals

ACTUAL Amphibians 0 0 0 0 2

CLASS Fishes 0 3 0 0 0

Reptiles 0 0 3 0 1

Birds 0 0 1 2 1

Mammals 0 2 1 0 4

C4.5 Amphibians Fishes Reptiles Birds Mammals

Amphibians 2 0 0 0 0

Fishes 0 2 0 0 1

Reptiles 1 0 3 0 0

Birds 1 0 0 3 0

Mammals 0 0 1 0 6

C4.5rules Amphibians Fishes Reptiles Birds Mammals

Amphibians 2 0 0 0 0

Fishes 0 2 0 0 1

Reptiles 1 0 3 0 0

Birds 1 0 0 3 0

Mammals 0 0 1 0 6

Sheet2

Sheet3

Sheet1

RIPPER: Amphibians Fishes Reptiles Birds Mammals

Amphibians 0 0 0 0 2

Fishes 0 3 0 0 0

Reptiles 0 0 3 0 1

Birds 0 0 1 2 1

Mammals 0 2 1 0 4

PREDICTED CLASS

Amphibians Fishes Reptiles Birds Mammals

ACTUAL Amphibians 2 0 0 0 0

CLASS Fishes 0 2 0 0 1

Reptiles 1 0 3 0 0

Birds 1 0 0 3 0

Mammals 0 0 1 0 6

C4.5rules Amphibians Fishes Reptiles Birds Mammals

Amphibians 2 0 0 0 0

Fishes 0 2 0 0 1

Reptiles 1 0 3 0 0

Birds 1 0 0 3 0

Mammals 0 0 1 0 6

Sheet2

Sheet3

Advantages of Rule-Based Classifiers
As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to decision trees

Instance-Based Classifiers
Store the training records
Use training records to
predict the class label of
unseen cases

Instance Based Classifiers
Examples:
Rote-learner
Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly
Nearest neighbor
Uses k “closest” points (nearest neighbors) for performing classification

Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then it’s probably a duck

Training Records
Test Record

Compute Distance

Choose k of the “nearest” records
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Nearest-Neighbor Classifiers
Requires three things
The set of stored records
Distance Metric to compute distance between records
The value of k, the number of nearest neighbors to retrieve
To classify an unknown record:
Compute distance to other training records
Identify k nearest neighbors
Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Definition of Nearest Neighbor
K-nearest neighbors of a record x are data points that have the k smallest distance to x
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

1 nearest-neighbor
Voronoi Diagram
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Nearest Neighbor Classification
Compute distance between two points:
Euclidean distance

Determine the class from nearest neighbor list
take the majority vote of class labels among the k-nearest neighbors
Weigh the vote according to distance
weight factor, w = 1/d2

Nearest Neighbor Classification…
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes

Nearest Neighbor Classification…
Scaling issues
Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M

Nearest Neighbor Classification…
Problem with Euclidean measure:
High dimensional data
curse of dimensionality
Can produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1
vs
d = 1.4142
d = 1.4142

Solution: Normalize the vectors to unit length

Nearest neighbor Classification…
k-NN classifiers are lazy learners
It does not build models explicitly
Unlike eager learners such as decision tree induction and rule-based systems
Classifying unknown records are relatively expensive

Example: PEBLS
PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)
Works with both continuous and nominal features
For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)
Each record is assigned a weight factor
Number of nearest neighbor, k = 1

Example: PEBLS
Distance between nominal attribute values:
d(Single,Married)
= | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
d(Single,Divorced)
= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
d(Married,Divorced)
= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
d(Refund=Yes,Refund=No)
= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
Class Marital Status
Single Married Divorced
Yes 2 0 1
No 2 4 1

Class Refund
Yes No
Yes 0 3
No 3 4

Tid

Refund

Marital

Status

Taxable

Income

Cheat

1
Yes
Single
125K
No

2
No
Married
100K

3
No
Single
70K

4
Yes
Married
120K

5
No
Divorced
95K

Yes

6
No
Married
60K

7
Yes
Divorced
220K

8
No
Single
85K

Yes

9
No
Married
75K

10
No
Single
90K

Yes

Example: PEBLS
Distance between record X and record Y:
where:
wX  1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Tid

Refund

Marital

Status

Taxable

Income

Cheat

Yes

Single

125K

Married

100K

Bayes Classifier
A probabilistic framework for solving classification problems
Conditional Probability:

Bayes theorem:

Example of Bayes Theorem
Given:
A doctor knows that meningitis causes stiff neck 50% of the time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck, what’s the probability he/she has meningitis?

Bayesian Classifiers
Consider each attribute and class label as random variables

Given a record with attributes (A1, A2,…,An)
Goal is to predict class C
Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )
Can we estimate P(C| A1, A2,…,An ) directly from data?

Bayesian Classifiers
Approach:
compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem

Choose value of C that maximizes
P(C | A1, A2, …, An)
Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)

How to estimate P(A1, A2, …, An | C )?

Naïve Bayes Classifier
Assume independence among attributes Ai when class is given:
P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

Can estimate P(Ai| Cj) for all Ai and Cj.

New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.

How to Estimate Probabilities from Data?
Class: P(C) = Nc/N
e.g., P(No) = 7/10,
P(Yes) = 3/10

For discrete attributes:

P(Ai | Ck) = |Aik|/ Nc
where |Aik| is number of instances having attribute Ai and belongs to class Ck
Examples:

P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
k
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

How to Estimate Probabilities from Data?
For continuous attributes:
Discretize the range into bins
one ordinal attribute per bin
violates independence assumption
Two-way split: (A < v) or (A > v)
choose only one of the two splits as new attribute
Probability density estimation:
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)
k
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

How to Estimate Probabilities from Data?
Normal distribution:

One for each (Ai,ci) pair
For (Income, Class=No):
If Class=No
sample mean = 110
sample variance = 2975

Example of Naïve Bayes Classifier
P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No)
 P(Income=120K| Class=No)
= 4/7  4/7  0.0072 = 0.0024

P(X|Class=Yes) = P(Refund=No| Class=Yes)
 P(Married| Class=Yes)
 P(Income=120K| Class=Yes)
= 1  0  1.2  10-9 = 0

Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Given a Test Record:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Naïve Bayes Classifier
If one of the conditional probability is zero, then the entire expression becomes zero
Probability estimation:
c: number of classes
p: prior probability
m: parameter
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Example of Naïve Bayes Classifier
A: attributes
M: mammals
N: non-mammals
P(A|M)P(M) > P(A|N)P(N)
=> Mammals
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

animals2

Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class

human yes no no no yes mammals

python no yes no no no reptiles

salmon no yes no yes no fishes

whale yes no no yes no mammals

frog no yes no sometimes yes amphibians

komodo no yes no no yes reptiles

bat yes no yes no yes mammals

pigeon no yes yes no yes birds

cat yes no no no yes mammals

leopard shark yes no no yes no fishes

turtle no yes no sometimes yes reptiles

penguin no yes no sometimes yes birds

porcupine yes no no no yes mammals

eel no yes no yes no fishes

salamander no yes no sometimes yes amphibians

gila monster no yes no no yes reptiles

platypus no yes no no yes mammals

owl no yes yes no yes birds

dolphin yes no no yes no mammals

eagle no yes yes no yes birds

Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class

human yes no no no yes mammals

python no yes no no no non-mammals

salmon no yes no yes no non-mammals

whale yes no no yes no mammals

frog no yes no sometimes yes non-mammals

komodo no yes no no yes non-mammals

bat yes no yes no yes mammals

pigeon no yes yes no yes non-mammals

cat yes no no no yes mammals

leopard shark yes no no yes no non-mammals

turtle no yes no sometimes yes non-mammals

penguin no yes no sometimes yes non-mammals

porcupine yes no no no yes mammals

eel no yes no yes no non-mammals

salamander no yes no sometimes yes non-mammals

gila monster no yes no no yes non-mammals

platypus no yes no no yes mammals

owl no yes yes no yes non-mammals

dolphin yes no no yes no mammals

eagle no yes yes no yes non-mammals

animals2

Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class

human yes no no no yes mammals

python no yes no no no reptiles

salmon no yes no yes no fishes

whale yes no no yes no mammals

frog no yes no sometimes yes amphibians

komodo no yes no no yes reptiles

bat yes no yes no yes mammals

pigeon no yes yes no yes birds

cat yes no no no yes mammals

leopard shark yes no no yes no fishes

turtle no yes no sometimes yes reptiles

penguin no yes no sometimes yes birds

porcupine yes no no no yes mammals

eel no yes no yes no fishes

salamander no yes no sometimes yes amphibians

gila monster no yes no no yes reptiles

platypus no yes no no yes mammals

owl no yes yes no yes birds

dolphin yes no no yes no mammals

eagle no yes yes no yes birds

Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class

human yes no no no yes mammals

python no yes no no no non-mammals

salmon no yes no yes no non-mammals

whale yes no no yes no mammals

frog no yes no sometimes yes non-mammals

komodo no yes no no yes non-mammals

bat yes no yes no yes mammals

pigeon no yes yes no yes non-mammals

cat yes no no no yes mammals

leopard shark yes no no yes no non-mammals

turtle no yes no sometimes yes non-mammals

penguin no yes no sometimes yes non-mammals

porcupine yes no no no yes mammals

eel no yes no yes no non-mammals

salamander no yes no sometimes yes non-mammals

gila monster no yes no no yes non-mammals

platypus no yes no no yes mammals

owl no yes yes no yes non-mammals

dolphin yes no no yes no mammals

eagle no yes yes no yes non-mammals

Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class

human yes no no yes no ?

Naïve Bayes (Summary)
Robust to isolated noise points
Handle missing values by ignoring the instance during probability estimate calculations
Robust to irrelevant attributes
Independence assumption may not hold for some attributes
Use other techniques such as Bayesian Belief Networks (BBN)

Artificial Neural Networks (ANN)
Output Y is 1 if at least two of the three inputs are equal to 1.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
X1�
X2�
X3�
Y�
Black box�
Output �
Input �

Artificial Neural Networks (ANN)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
S�
X1�
X2�
X3�
Y�
Black box�
0.3�
0.3�
0.3�
t=0.4�
Output node�
�
Input nodes�
�
�
�

Artificial Neural Networks (ANN)
Model is an assembly of inter-connected nodes and weighted links
Output node sums up each of its input value according to the weights of its links
Compare output node against some threshold t
Perceptron Model
or
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
S�
X1�
X2�
X3�
Y�
Black box�
w1�
w2�
w3�
t�
Output node�
�
Input nodes�
�
�
�

General Structure of ANN
Training ANN means learning the weights of the neurons
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
�
Activation function �g(Si )�
Si�
Oi�
I1�
I2�
I3�
wi1�
wi2�
wi3�
Oi�
Neuron i�
Input�
Output�
threshold, t�

Input Layer�
Hidden Layer�
Output Layer�
x1�
x2�
x3�
x4�
x5�
y�

Algorithm for learning ANN
Initialize the weights (w0, w1, …, wk)
Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples
Objective function:
Find the weights wi’s that minimize the above objective function
e.g., backpropagation algorithm (see lecture notes)

Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data

Support Vector Machines
One Possible Solution

Support Vector Machines
Another possible solution

Support Vector Machines
Other possible solutions

Support Vector Machines
Which one is better? B1 or B2?
How do you define better?

Support Vector Machines
Find hyperplane maximizes the margin => B1 is better than B2

Support Vector Machines
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
B1�
b11�
b12�

Support Vector Machines
We want to maximize:
Which is equivalent to minimizing:
But subjected to the following constraints:

This is a constrained optimization problem
Numerical approaches to solve it (e.g., quadratic programming)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Support Vector Machines
What if the problem is not linearly separable?

Support Vector Machines
What if the problem is not linearly separable?
Introduce slack variables
Need to minimize:
Subject to:

Nonlinear Support Vector Machines
What if decision boundary is not linear?

Nonlinear Support Vector Machines
Transform data into higher dimensional space

Ensemble Methods
Construct a set of classifiers from the training data
Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

General Idea
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Original Training data�
….�
D1�
D2�
Dt-1�
Dt�
D�
Step 1: �Create Multiple Data Sets�
�
C1�
�
C2�
�
Ct -1�
�
Ct�
Step 2: �Build Multiple Classifiers�
C*�
Step 3: �Combine Classifiers�

Why does it work?
Suppose there are 25 base classifiers
Each classifier has error rate,  = 0.35
Assume classifiers are independent
Probability that the ensemble classifier makes a wrong prediction:

Examples of Ensemble Methods
How to generate an ensemble of classifiers?
Bagging
Boosting

Bagging
Sampling with replacement

Build classifier on each bootstrap sample
Each sample has probability (1 – 1/n)n of being selected

Boosting
An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records
Initially, all N records are assigned equal weights
Unlike bagging, weights may change at the end of boosting round

Boosting
Records that are wrongly classified will have their weights increased
Records that are classified correctly will have their weights decreased
Example 4 is hard to classify
Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

Example: AdaBoost
Base classifiers: C1, C2, …, CT

Error rate:

Importance of a classifier:

Example: AdaBoost
Weight update:

If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated
Classification:

Illustrating AdaBoost

Data points for training
Initial weights for each data point

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Boosting Round 1�
+�
+�
+�
-�
-�
-�
-�
-�
-�
-�
B1�
0.0094�
0.0094�
0.4623�
a = 1.9459�

Illustrating AdaBoost
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Boosting Round 1�
+�
+�
+�
-�
-�
-�
-�
-�
-�
-�
Boosting Round 2�
0.3037�
-�
-�
-�
-�
-�
-�
-�
-�
+�
+�
Boosting Round 3�
+�
+�
+�
+�
+�
+�
+�
+�
+�
+�
Overall�
+�
+�
+�
-�
-�
-�
-�
-�
+�
+�
0.0009�
0.0422�
0.0276�
0.1819�
0.0038�
B1�
0.0094�
0.0094�
0.4623�
B2�
B3�
a = 1.9459�
a = 2.9323�
a = 3.8744�

YES
YES
NO
NO
NO
NO
NO
NO
Yes
No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Classification Rules
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Tid

Refund

Marital

Status

Taxable

Income

Cheat

Yes

Single

125K

Married

100K

Single

70K

Yes

Married

120K

Divorced

95K

Yes

Married

60K

Yes

Divorced

220K

Single

85K

Yes

Married

75K

Singl
e

90K

Yes

NameGive BirthLay EggsCan FlyLive in WaterHave LegsClass
humanyesnononoyesmammals
pythonnoyesnononoreptiles
salmonnoyesnoyesnofishes
whaleyesnonoyesnomammals
frognoyesnosometimesyesamphibians
komodonoyesnonoyesreptiles
batyesnoyesnoyesmammals
pigeonnoyesyesnoyesbirds
catyesnononoyesmammals
leopard sharkyesnonoyesnofishes
turtlenoyesnosometimesyesreptiles
penguinnoyesnosometimesyesbirds
porcupineyesnononoyesmammals
eelnoyesnoyesnofishes
salamandernoyesnosometimesyesamphibians
gila monsternoyesnonoyesreptiles
platypusnoyesnonoyesmammals
owlnoyesyesnoyesbirds
dolphinyesnonoyesnomammals
eaglenoyesyesnoyesbirds
Give
Birth?
Live In
Water?
Can
Fly?
Mammals
Fishes
Amphibians
Birds
Reptiles
Yes
No
Yes
Sometimes
No
Yes
No
Name
Blood Type
Give Birth
Can Fly
Live in Water
Class
lemur
warm
yes
no
no
?
turtle
cold
no
no
sometimes
?
dogfish shark
cold
yes
no
yes
?
Atr1
………
AtrN
Class
A
B
B
C
A
C
B
Set of Stored Cases
k
n
n
c
+
+
=
1
(ii) Step 1
(iii) Step 2
R1
(iv) Step 3
R1
R2
PREDICTED CLASS
AmphibiansFishesReptilesBirdsMammals
ACTUALAmphibians00002
CLASSFishes03000
Reptiles00301
Birds00121
Mammals02104
PREDICTED CLASS
AmphibiansFishesReptilesBirdsMammals
ACTUALAmphibians20000
CLASSFishes02001
Reptiles10300
Birds10030
Mammals00106
Status =
Single
Status =
Divorced
Status =
Married
Income
> 80K
…
Yes: 3
No: 4
{ }
Yes: 0
No: 3
Refund=
No
Yes: 3
No: 4
Yes: 2
No: 1
Yes: 1
No: 0
Yes: 3
No: 1
(a) General-to-specific
Name
Blood Type
Give Birth
Can Fly
Live in Water
Class
turtle
cold
no
no
sometimes
?
Tid Refund Marital
Status
Taxable
Income
Class
1 Yes
Single
125K
No
2 No Married 100K
No
3 No
Single
70K
No
4 Yes Married 120K
No
5 No Divorced 95K
Yes
6 No Married 60K
No
7 Yes Divorced 220K
No
8 No
Single
85K
Yes
9 No Married 75K
No
10 No
Single
90K
Yes
10

k
n
kp
n
c
+
+
=
Name
Blood Type
Give Birth
Can Fly
Live in Water
Class
human
warm
yes
no
no
mammals
python
cold
no
no
no
reptiles
salmon
cold
no
no
yes
fishes
whale
warm
yes
no
yes
mammals
frog
cold
no
no
sometimes
amphibians
komodo
cold
no
no
no
reptiles
bat
warm
yes
yes
no
mammals
pigeon
warm
no
yes
no
birds
cat
warm
yes
no
no
mammals
leopard shark
cold
yes
no
yes
fishes
turtle
cold
no
no
sometimes
reptiles
penguin
warm
no
no
sometimes
birds
porcupine
warm
yes
no
no
mammals
eel
cold
no
no
yes
fishes
salamander
cold
no
no
sometimes
amphibians
gila monster
cold
no
no
no
reptiles
platypus
warm
no
no
no
mammals
owl
warm
no
yes
no
birds
dolphin
warm
yes
no
yes
mammals
eagle
warm
no
yes
no
birds
Name
Blood Type
Give Birth
Can Fly
Live in Water
Class
hawk
warm
no
yes
no
?
grizzly bear
warm
yes
no
no
?
Refund=No,
Status=Single,
Income=85K
(Class=Yes)
Refund=No,
Status=Single,
Income=90K
(Class=Yes)
Refund=No,
Status = Single
(Class = Yes)
(b) Specific-to-general
Rule-based Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Class-based Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Married}) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(i) Original Data
n
n
c
=
class = +
class = –
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
–
–
–
–
—
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
+
+
+
+
+
+
+
R1
R3R2
+
+
Rule Set
r1: (P=No,Q=No) ==> –
r2: (P=No,Q=Yes) ==> +
r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> –
r5: (P=Yes,R=Yes,Q=Yes) ==> +
P
QR
Q
-++
-+
NoNo
No
Yes
Yes
Yes
NoYes
Atr1
………
AtrN
Unseen Case
Unknown record
X
X
X
(a) 1-nearest neighbor
(b) 2-nearest neighbor
(c) 3-nearest neighbor
å
–
=
i
i
i
q
p
q
p
d
2
)
(
)
,
(
X
å
–
=
i
i
i
n
n
n
n
V
V
d
2
2
1
1
2
1
)
,
(
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
å
=
=
D
d
i
i
i
Y
X
Y
X
d
w
w
Y
X
1
2
)
,
(
)
,
(
Tid Refund Marital
Status
Taxable
Income
Cheat
X Yes Single 125K
No
Y No Married 100K
No
10

correctly

predicts

times
of
Number
prediction
for

used

times
of
Number
=
X
w
)
(
)
(
)
|
(
)
|
(
A
P
C
P
C
A
P
A
C
P
=
)
(
)
,
(
)
|
(
)
(
)
,
(
)
|
(
C
P
C
A
P
C
A
P
A
P
C
A
P
A
C
P
=
=
0002
.
0
20
/
1
50000
/
1
5
.
0
)
(
)
(
)
|
(
)
|
(
=
´
=
=
S
P
M
P
M
S
P
S
M
P
)
(
)
(
)
|
(
)
|
(
2
1
2
1
2
1
n
n
n
A
A
A
P
C
P
C
A
A
A
P
A
A
A
C
P
K
K
K
=
Tid

Refund

Marital

Status

Taxable

Income

Evade

Yes

Single

125K

Married

100K

Single

70K

Yes

Married

120K

Divorced

95K

Yes

Married

60K

Yes

Divorced

220K

Single

85K

Yes

Married

75K

Singl
e

90K

Yes

categorical
categorical
continuous
class
2
2
2
)
(
2
2
1
)
|
(
ij
ij
i
A
ij
j
i
e
c
A
P
s
m
ps
–
–
=
0072
.
0
)
54
.
54
(
2
1
)
|
120
(
)
2975
(
2
)
110
120
(
2
=
=
=
–
–
e
No
Income
P
p
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes:
sample mean=90
sample variance=25
naive Bayes Classifier:
120K)
Income
Married,
No,
Refund
(
=
=
=
X
m
N
mp
N
C
A
P
c
N
N
C
A
P
N
N
C
A
P
c
ic
i
c
ic
i
c
ic
i
+
+
=
+
+
=
=
)
|
(
:
estimate
–
m
1
)
|
(
:
Laplace
)
|
(

:
Original
NameGive BirthCan FlyLive in WaterHave LegsClass
humanyesnonoyesmammals
pythonnononononon-mammals
salmonnonoyesnonon-mammals
whaleyesnoyesnomammals
frognonosometimesyesnon-mammals
komodonononoyesnon-mammals
batyesyesnoyesmammals
pigeonnoyesnoyesnon-mammals
catyesnonoyesmammals
leopard sharkyesnoyesnonon-mammals
turtlenonosometimesyesnon-mammals
penguinnonosometimesyesnon-mammals
porcupineyesnonoyesmammals
eelnonoyesnonon-mammals
salamandernonosometimesyesnon-mammals
gila monsternononoyesnon-mammals
platypusnononoyesmammals
owlnoyesnoyesnon-mammals
dolphinyesnoyesnomammals
eaglenoyesnoyesnon-mammals
Give BirthCan FlyLive in WaterHave LegsClass
yesnoyesno?
0027
.
0
20
13
004
.
0
)
(
)
|
(
021
.
0
20
7
06
.
0
)
(
)
|
(
0042
.
0
13
4
13
3
13
10
13
1
)
|
(
06
.
0
7
2
7
2
7
6
7
6
)
|
(
=
´
=
=
´
=
=
´
´
´
=
=
´
´
´
=
N
P
N
A
P
M
P
M
A
P
N
A
P
M
A
P
X
1
X
2
X
3
Y
1000
1011
1101
1111
0010
0100
0111
0000
X
1
X
2
X
3
Y
Black box
Output
Input
X
1
X
2
X
3
Y
1000
1011
1101
1111
0010
0100
0111
0000

X
1
X
2
X
3
Y
Black box
0.3
0.3
0.3
t=0.4
Output
node
Input
nodes
î
í
ì
=
>
–
+
+
=
otherwise
0
true
is

if
1
)
(

where
)
0
4
.
0
3
.
0
3
.
0
3
.
0
(
3
2
1
z
z
I
X
X
X
I
Y

X
1
X
2
X
3
Y
Black box
w
1
t
Output
node
Input
nodes
w
2
w
3
)
(
t
X
w
I
Y
i
i
i
–
=
å
)
(
t
X
w
sign
Y
i
i
i
–
=
å
Activation
function
g(S
i
)
S
i
O
i
I
1
I
2
I
3
w
i1
w
i2
w
i3
O
i
Neuron iInputOutput
threshold, t
Input
Layer
Hidden
Layer
Output
Layer
x
1
x
2
x
3
x
4
x
5
y
[
]
2
)
,
(
å
–
=
i
i
i
i
X
w
f
Y
E
B
1
B
2
B
1
B
2
B
1
B
2
b
11
b
12
b
21
b
22
margin
B
1
b
11
b
12
0
=
+
·
b
x
w
r
r
1
–
=
+
·
b
x
w
r
r
1
+
=
+
·
b
x
w
r
r
î
í
ì
–
£
+
·
–
³
+
·
=
1
b
x
w

if
1
1
b
x
w

if
1
)
(
r
r
r
r
r
x
f
2
||
||
2

Margin
w
r
=
î
í
ì
–
£
+
·
–
³
+
·
=
1
b
x
w

if
1
1
b
x
w

if
1
)
(
i
i
r
r
r
r
r
i
x
f
2
||
||
)
(
2
w
w
L
r
=
î
í
ì
+
–
£
+
·
–
³
+
·
=
i
i
i
i
1
b
x
w

if
1
–
1
b
x
w

if
1
)
(
x
x
r
r
r
r
r
i
x
f
÷
ø
ö
ç
è
æ
+
=
å
=
N
i
k
i
C
w
w
L
1
2
2
||
||
)
(
x
r
Original
Training data
….
D
1
D
2
D
t-1
D
t
D
Step 1:
Create Multiple
Data Sets
C
1
C
2
C
t -1
C
t
Step 2:
Build Multiple
Classifiers
C
*
Step 3:
Combine
Classifiers
å
=
–
=
–
÷
÷
ø
ö
ç
ç
è
æ
25
13
25
06
.
0
)
1
(
25
i
i
i
i
e
e
Original Data
1
2
3
4
5
6
7
8
9
10
Bagging (Round 1)
7
8
10
8
2
5
10
10
5
9
Bagging (Round 2)
1
4
9
1
2
3
2
7
3
2
Bagging (Round 3)
1
8
5
10
5
5
9
6
3
7
Original Data
1
2
3
4
5
6
7
8
9
10
Boosting (Round 1)
7
3
2
8
7
9
4
10
6
3
Boosting (Round 2)
5
4
9
4
2
5
1
7
4
2
Boosting (Round 3)
4
4
8
10
4
5
4
6
3
4
(
)
å
=
¹
=
N
j
j
j
i
j
i
y
x
C
w
N
1
)
(
1
d
e
÷
÷
ø
ö
ç
ç
è
æ
–
=
i
i
i
e
e
a
1
ln
2
1
factor

ion
normalizat

the
is

where
)
(

if
exp
)
(

if
exp
)
(
)
1
(
j
i
i
j
i
i
j
j
j
i
j
i
Z
y
x
C
y
x
C
Z
w
w
j
j
ï
î
ï
í
ì
¹
=
=
–
+
a
a
(
)
å
=
=
=
T
j
j
j
y
y
x
C
x
C
1
)
(
max
arg
)
(
*
d
a
Boosting
Round 1
+++
——-
0.00940.00940.4623
B1
 = 1.9459
Original
Data
+++
—–
++
0.1
0.10.1
Boosting
Round 1
+++
——-
Boosting
Round 2
——–
++
Boosting
Round 3
++++++++++
Overall
+++
—–
++
0.00940.00940.4623
0.3037
0.00090.0422
0.0276
0.1819
0.0038
B1
B2
B3
 = 1.9459
 = 2.9323
 = 3.8744

Data Mining: Introduction
Lecture Notes for Chapter 1
Introduction to Data Mining
by
Tan, Steinbach, Kumar

Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Why Mine Data? Commercial Viewpoint

Why Mine Data? Scientific Viewpoint
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation

Mining Large Data Sets – Motivation
There is often information “hidden” in the data that is
not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analyzed at all
The Data Gap
Total new disk (TB) since 1995
Number of analysts
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

disks

Units Capacity PBs

1995 89,054 104.8

1996 105,686 183.9

1997 129,281 343.63

1998 143,649 724.36

1999 165,857 1394.6

2000 187,835 2553.7

2001 212,800 4641

2002 239,138 8119

2003 268,227 13027

1995 104.8

1996 183.9

1997 343.63

1998 724.36

1999 1394.6

2000 2553.7

2001 4641

2002 8119

2003 13027

disks

0
0
0
0
0
0
0
0
0

chart data gap

1995 1995

1996 1996

1997 1997

1998 1998

1999 1999

26535
105700
27229
227400
27245
425330
27309
891970
25953
1727000

chart data gap 2

1995 1995

1996 1996

1997 1997

1998 1998

1999 1999

26535
105700
27229
333100
27245
758430
27309
1650400
25953
3377400

data gap

Ph.D. Petabytes Terabytes Total TBs PBs

1995 105.7 105700 105700 105.7

1996 227.4 227400 333100 333.1

1997 425.33 425330 758430 758.43

1998 891.97 891970 1650400 1650.4

1999 1727 1727000 3377400 3377.4

2000 5792 5792000 9169400 9169.4

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999

Science and engineering Ph.D.s, total 22,868 24,023 24,675 25,443 26,205 26,535 27,229 27,245 27,309 25,953

105700 333100 758430 1650400 3377400

Sheet3

What is Data Mining?
Many Definitions
Non-trivial extraction of implicit, previously unknown and potentially useful information from data
Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns

What is (not) Data Mining?
What is Data Mining?

Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
What is not Data Mining?
Look up phone number in phone directory

Query a Web search engine for information about “Amazon”

Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
Traditional Techniques
may be unsuitable due to
Enormity of data
High dimensionality
of data
Heterogeneous,
distributed nature
of data

Origins of Data Mining
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database systems

Data Mining Tasks
Prediction Methods
Use some variables to predict unknown or future values of other variables.

Description Methods
Find human-interpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Data Mining Tasks…
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]

Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Example
categorical
categorical
continuous
class
Training
Set
Learn
Classifier

Test
Set
Model

Tid

Refund

Marital

Status

Taxable

Income

Cheat

1
Yes
Single
125K
No

2
No
Married
100K

3
No
Single
70K

4
Yes
Married
120K

5
No
Divorced
95K

Yes

6
No
Married
60K

7
Yes
Divorced
220K

8
No
Single
85K

Yes

9
No
Married
75K

10
No
Single
90K

Yes

Refund

Marital

Status

Taxable

Income

Cheat

No
Single
75K
?

Yes
Married
50K

No
Married
150K

Yes
Divorced
90K

No
Single
40K

No
Married
80K

Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute.
Collect various demographic, lifestyle, and company-interaction related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an account.

Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a competitor.
Approach:
Use detailed record of transactions with each of the past and present customers, to find attributes.
How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 4
Sky Survey Cataloging
Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).
3000 images with 23,040 x 23,040 pixels per image.
Approach:
Segment the image.
Measure image attributes (features) – 40 of them per object.
Model the class based on these features.
Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classifying Galaxies
Early
Intermediate
Late
Data Size:
72 million stars, 20 million galaxies
Object Catalog: 9 GB
Image Database: 150 GB
Class:
Stages of Formation
Attributes:
Image features,
Characteristics of light waves received, etc.
Courtesy: http://aps.umn.edu

Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.

Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
Intercluster distances
are maximized

Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to each other based on the important terms appearing in them.
Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Illustrating Document Clustering
Clustering Points: 3204 Articles of Los Angeles Times.
Similarity Measure: How many words are common in these documents (after some word filtering).

Category

Total Articles

Correctly Placed

Financial

555
364

Foreign

341
260

National

273
36

Metro

943
746

Sports

738
573

Entertainment

354
278

Clustering of S&P 500 Stock Data
Observe Stock Movements every day.
Clustering points: Stock-{UP/DOWN}
Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day.
We used association rules to quantify a similarity measure.

Discovered Clusters

Industry Group

Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,

Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN
Technology1-DOWN

Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,

Computer-Assoc-DOWN,Circuit-City-DOWN,

Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,

Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN

Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

MBNA-Corp-DOWN,Morgan-Stanley-DOWN
Financial-DOWN

Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,

Schlumberger-UP
Oil-UP

Association Rule Discovery: Definition
Given a set of records each of which contain some number of items from a given collection;
Produce dependency rules which will predict occurrence of an item based on occurrences of other items.
Rules Discovered:
{Milk} –> {Coke}
{Diaper, Milk} –> {Beer}

TID
Items

Bread, Coke, Milk

Beer, Bread

Beer, Coke, Diaper, Milk

Beer, Bread, Diaper, Milk

Coke, Diaper, Milk

Association Rule Discovery: Application 1
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } –> {Potato Chips}
Potato Chips as consequent => Can be used to determine what should be done to boost its sales.
Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

Association Rule Discovery: Application 2
Supermarket shelf management.
Goal: To identify items that are bought together by sufficiently many customers.
Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items.
A classic rule —
If a customer buys diaper and milk, then he is very likely to buy beer.
So, don’t be surprised if you find six-packs stacked next to diapers!

Association Rule Discovery: Application 3
Inventory Management:
Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households.
Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.

Sequential Pattern Discovery: Definition
Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.

Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints.

(A B) (C) (D E)
<= ms <= xg >ng
<= ws (A B) (C) (D E) Sequential Pattern Discovery: Examples In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm)
In point-of-sale transaction sequences,
Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) –> (Perl_for_dummies,Tcl_Tk)
Athletic Apparel Store:
(Shoes) (Racket, Racketball) –> (Sports_Jacket)

Regression
Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based on advetising expenditure.
Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
Time series prediction of stock market indices.

Deviation/Anomaly Detection
Detect significant deviations from normal behavior
Applications:
Credit Card Fraud Detection

Network Intrusion
Detection

Typical network traffic at University level may reach over 100 million connections per day

Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data

0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
19951996199719981999
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Category
Total
Articles
Correctly
Placed
Financial
555
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
Discovered Clusters
Industry Group
1
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Technology1-DOWN
2
Apple-Comp-
DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-
DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN,
Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN
3
Fannie-Mae-
DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-
DOWN,Morgan-Stanley-DOWN
Financial-DOWN
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk

Data Mining: Data

Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar

What is Data?
Collection of data objects and their attributes

An attribute is a property or characteristic of an object
Examples: eye color of a person, temperature, etc.
Attribute is also known as variable, field, characteristic, or feature
A collection of attributes describe an object
Object is also known as record, point, case, sample, entity, or instance
Attributes
Objects

Attribute Values
Attribute values are numbers or symbols assigned to an attribute

Distinction between attributes and attribute values
Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters

Different attributes can be mapped to the same set of values
Example: Attribute values for ID and age are integers
But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Measurement of Length
The way you measure an attribute is somewhat may not match the attributes properties.

Types of Attributes
There are different types of attributes
Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}
Interval
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio
Examples: temperature in Kelvin, length, time, counts

Properties of Attribute Values
The type of an attribute depends on which of the following properties it possesses:
Distinctness: = 
Order: < >
Addition: + –
Multiplication: * /

Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties

Attribute Type

Description

Examples

Operations

Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best},
grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, – )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson’s correlation, t and F tests
Ratio
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation

Attribute Level

Transformation

Comments

Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An order preserving change of values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}.
Interval
new_value =a * old_value + b where a and b are constants
Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree).
Ratio
new_value = a * old_value
Length can be measured in meters or feet.

Discrete and Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes

Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a finite number of digits.
Continuous attributes are typically represented as floating-point variables.

Types of data sets
Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data

Important Characteristics of Structured Data
Dimensionality
Curse of Dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale

Record Data
Data that consists of a collection of records, each of which consists of a fixed set of attributes

Tid

Refund

Marital

Status

Taxable

Income

Cheat

Yes

Single

125K

Married

100K

Single

70K

Yes

Married

120K

Divorced

95K

Yes

Married

60K

Yes

Divorced

220K

Single

85K

Yes

Married

75K

Single

90K

Yes

Data Matrix
If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute

Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

Document Data
Each document becomes a `term’ vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term occurs in the document.

Document 1�
season�
timeout�
lost�
win�
game�
score�
ball�
play�
coach�
team�
Document 2�
Document 3�
3�
0�
5�
0�
2�
6�
0�
2�
0�
2�
0�
0�
7�
0�
2�
1�
0�
0�
3�
0�
0�
1�
0�
0�
1�
2�
2�
0�
3�
0�

Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

TID

Items

Bread, Coke, Milk

Beer, Bread

Beer, Coke, Diaper, Milk

Beer, Bread, Diaper, Milk

Coke, Diaper, Milk

Graph Data
Examples: Generic graph and HTML Links

Chemical Data
Benzene Molecule: C6H6

Ordered Data
Sequences of transactions
An element of the sequence
Items/Events
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Ordered Data
Genomic sequence data

Ordered Data
Spatio-Temporal Data
Average Monthly Temperature of land and ocean
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?

Examples of data quality problems:
Noise and outliers
missing values
duplicate data

Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen

Two Sine Waves
Two Sine Waves + Noise
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Outliers
Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their probabilities)

Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues

Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation

Aggregation
Combining two or more attributes (or objects) into a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More “stable” data
Aggregated data tends to have less variability

Aggregation
Standard Deviation of Average Monthly Precipitation
Standard Deviation of Average Yearly Precipitation
Variation of Precipitation in Australia
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sampling
Sampling is the main technique employed for data selection.
It is often used for both the preliminary investigation of the data and the final data analysis.

Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming.

Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.

Sampling …
The key principle for effective sampling is the following:
using a sample will work almost as well as using the entire data sets, if the sample is representative
A sample is representative if it has approximately the same property (of interest) as the original set of data

Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item

Sampling without replacement
As each item is selected, it is removed from the population

Sampling with replacement
Objects are not removed from the population as they are selected for the sample.
In sampling with replacement, the same object can be picked up more than once

Stratified sampling
Split the data into several partitions; then draw random samples from each partition

Sample Size

8000 points 2000 Points 500 Points
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sample Size
What sample size is necessary to get at least one object from each of 10 groups.

Curse of Dimensionality
When dimensionality increases, data becomes increasingly sparse in the space that it occupies
Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful

Randomly generate 500 points
Compute difference between max and min distance between any pair of points

Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce noise

Techniques
Principle Component Analysis
Singular Value Decomposition
Others: supervised and non-linear techniques

Dimensionality Reduction: PCA
Goal is to find a projection that captures the largest amount of variation in data
x2
x1
e
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Dimensionality Reduction: PCA
Find the eigenvectors of the covariance matrix
The eigenvectors define the new space
x2
x1
e
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Dimensionality Reduction: ISOMAP
Construct a neighbourhood graph
For each pair of points in the graph, compute the shortest path distances – geodesic distances
By: Tenenbaum, de Silva, Langford (2000)

Dimensionality Reduction: PCA

Feature Subset Selection
Another way to reduce dimensionality of data

Redundant features
duplicate much or all of the information contained in one or more other attributes
Example: purchase price of a product and the amount of sales tax paid

Irrelevant features
contain no information that is useful for the data mining task at hand
Example: students’ ID is often irrelevant to the task of predicting students’ GPA

Feature Subset Selection
Techniques:
Brute-force approch:
Try all possible feature subsets as input to data mining algorithm
Embedded approaches:
Feature selection occurs naturally as part of the data mining algorithm
Filter approaches:
Features are selected before data mining algorithm is run
Wrapper approaches:
Use the data mining algorithm as a black box to find best subset of attributes

Feature Creation
Create new attributes that can capture the important information in a data set much more efficiently than the original attributes

Three general methodologies:
Feature Extraction
domain-specific
Mapping Data to New Space
Feature Construction
combining features

Mapping Data to a New Space

Two Sine Waves
Two Sine Waves + Noise
Frequency
Fourier transform
Wavelet transform

Discretization Using Class Labels
Entropy based approach

3 categories for both x and y
5 categories for both x and y
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Discretization Without Using Class Labels

Data
Equal interval width
Equal frequency
K-means
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Attribute Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
Simple functions: xk, log(x), ex, |x|
Standardization and Normalization

Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Euclidean Distance
Euclidean Distance

Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
Standardization is necessary, if scales differ.

Euclidean Distance

Distance Matrix
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sheet1

point x y

0 2

p2 2 0

p3 3 1

p4 5 1

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Sheet2

Sheet3

Sheet1

point x y

0 2

p2 2 0

p3 3 1

p4 5 1

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

Sheet2

Sheet3

Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Minkowski Distance: Examples
r = 1. City block (Manhattan, taxicab, L1 norm) distance.
A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors

r = 2. Euclidean distance

r  . “supremum” (Lmax norm, L norm) distance.
This is the maximum difference between any component of the vectors

Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Minkowski Distance

Distance Matrix
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sheet1

point x y

0 2

p2 2 0

p3 3 1

p4 5 1

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

L1 p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

L2 p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

p1 p2 p3 p4

p1 0 2 3 5

p2 2 0 1 3

p3 3 1 0 2

p4 5 3 2 0

Sheet2

Sheet3

Sheet1

point x y

0 2

p2 2 0

p3 3 1

p4 5 1

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

L1 p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

L2 p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

p1 p2 p3 p4

p1 0 2 3 5

p2 2 0 1 3

p3 3 1 0 2

p4 5 3 2 0

Sheet2

Sheet3

Sheet1

point x y

0 2

p2 2 0

p3 3 1

p4 5 1

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

L1 p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

L2 p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

p1 p2 p3 p4

p1 0 2 3 5

p2 2 0 1 3

p3 3 1 0 2

p4 5 3 2 0

Sheet2

Sheet3

Sheet1

point x y

0 2

p2 2 0

p3 3 1

p4 5 1

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

L1 p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

L2 p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

p1 p2 p3 p4

p1 0 2 3 5

p2 2 0 1 3

p3 3 1 0 2

p4 5 3 2 0

Sheet2

Sheet3

Mahalanobis Distance

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
 is the covariance matrix of the input data X
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Mahalanobis Distance
Covariance Matrix:
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4

Common Properties of a Distance
Distances, such as the Euclidean distance, have some well known properties.

d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
d(p, q) = d(q, p) for all p and q. (Symmetry)
d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
A distance that satisfies these properties is a metric

Common Properties of a Similarity
Similarities, also have some well known properties.

s(p, q) = 1 (or maximum similarity) only if p = q.
s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects), p and q.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Similarity Between Binary Vectors
Common situation is that objects, p and q, have only binary attributes

Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Cosine Similarity
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.
Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Extended Jaccard Coefficient (Tanimoto)
Variation of Jaccard for continuous or count attributes
Reduces to Jaccard for binary attributes

Correlation
Correlation measures the linear relationship between objects
To compute correlation, we standardize data objects, p and q, and then take their dot product

Visually Evaluating Correlation

Scatter plots showing the similarity from –1 to 1.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

General Approach for Combining Similarities
Sometimes attributes are of many different types, but an overall similarity is needed.

Using Weights to Combine Similarities
May not want to treat all attributes the same.
Use weights wk which are between 0 and 1 and sum to 1.

Density
Density-based clustering require a notion of density
Examples:
Euclidean density
Euclidean density = number of points per unit volume
Probability density
Graph-based density

Euclidean Density – Cell-based
Simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains

Euclidean Density – Center-based
Euclidean density is the number of points within a specified radius of the point

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
Tid

Refund

Marital

Status

Taxable

Income

Cheat

Yes

Single

125K

Married

100K

Single

70K

Yes

Married

120K

Divorced

95K

Yes

Married

60K

Yes

Divorced

220K

Single

85K

Yes

Married

75K

Singl
e

90K

Yes

Document 1
s
e
a
s
o
n
t
i
m
e
o
u
t
l
o
s
t
w
i
n
g
a
m
e
s
c
o
r
e
b
a
l
l
p
l
a
y
c
o
a
c
h
t
e
a
m
Document 2
Document 3
3050260202
0
0
702100300
100122030
TID

Items

Bread, Coke, Milk

Beer, Bread

Beer, Coke, Diaper, Milk

Beer, Bread, Diaper, Milk

Coke, Diaper, Milk

5
2
1
2
5

Data Mining

Graph Partitioning

Parallel Solution of Sparse Linear System of Equations

N-Body Computation and Dense Linear System Solvers
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
1
2
3
5
5
7
8
15
10
4
A
B
C
D
E
Dimensions = 10
Dimensions = 40
Dimensions = 80
Dimensions = 120
Dimensions = 160
Dimensions = 206
å
=
–
=
n
k
k
k
q
p
dist
1
2
)
(
0
1
2
3
0
1
2
3
4
5
6
p1
p2
p3
p4
pointxy
p102
p220
p331
p451
p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
r
n
k
r
k
k
q
p
dist
1
1
)
|
|
(
å
=
–
=
pointxy
p102
p220
p331
p451
L1p1p2p3p4
p10446
p24024
p34202
p46420
L2p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
L

p1p2p3p4
p10235
p22013
p33102
p45320
T
q
p
q
p
q
p
s
mahalanobi
)
(
)
(
)
,
(
1
–
å
–
=
–
å
=
–
–
–
=
S
n
i
k
ik
j
ij
k
j
X
X
X
X
n
1
,
)
)(
(
1
1
ú
û
ù
ê
ë
é
=
S
3
.
0
2
.
0
2
.
0
3
.
0
)
(
/
))
(
(
p
std
p
mean
p
p
k
k
–
=
¢
)
(
/
))
(
(
q
std
q
mean
q
q
k
k
–
=
¢
q
p
q
p
n
correlatio
¢
·
¢
=
)
,
(

Data

Mining: Exploring Data

Lecture Notes for Chapter 3
Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Look for accompanying R
code on the course web site.

Techniques Used In Data Exploration

 In EDA, as originally defined by Tukey
– The focus was on visualization
– Clustering and anomaly detection were viewed as

exploratory techniques
– In data mining, clustering and anomaly detection are

major areas of interest, and not thought of as just
exploratory

 In our discussion of data exploration, we focus on
– Summary statistics
–

Visualization

– Online Analytical Processing (OLAP)

What is data exploration?

• Key motivations of data exploration include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns

• People can recognize patterns not captured by data analysis
tools

• Related to the area of Exploratory Data Analysis (EDA)
– Created by statistician John Tukey
– Seminal book is “Exploratory Data Analysis” by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST

Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm

A preliminary exploration of the data to
better understand its characteristics.

Iris Sample Data Set

• Many of the exploratory data techniques are illustrated
with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository

http://www.ics.uci.edu/~mlearn/MLRepository.html

– From the statistician R.A. Fisher
– Three flower types (classes):

•

Setosa

•

Virginica

• Versicolour

– Four (non-class) attributes
• Sepal width and length
• Petal width and length Virginica. Robert H. Mohlenbrock. USDA

NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast
National Technical Center, Chester, PA.
Courtesy of USDA NRCS Wetland Science
Institute.

Summary Statistics

• Summary statistics are numbers that summarize
properties of the data

– Summarized properties include location and
spread for continuous data

• Examples: location – mean
spread – standard deviation

-Most summary statistics can be calculated in a
single pass through the data

Frequency and Mode

• The frequency of an attribute value is the
percentage of time the value occurs in the
data set
– For example, given the attribute ‘gender’ and a

representative population of people, the gender
‘female’ occurs about 50% of the time.

• The mode of an attribute is the most frequent attribute
value

• The notions of frequency and mode are typically used
with categorical data

Measures of Location: Mean and

Median

• The mean is the most common measure of the
location of a set of points.

• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also

commonly used.

Measures of Spread: Range and Variance

• Range is the difference between the max and min
• The variance or standard deviation is the most

common measure of the spread of a set of points.

• However, this is also sensitive to outliers, so that
other measures are often used.

Percentiles

x p
x p

Median – 50% of the
cases has a smaller value & 50%

are larger

Multivariate Summary Statistics
Object x

1
x
2

1 12 15
2 2 4
… … …
m 18 4

• Covariance between
features i and j

• Correlation

si is the variance of
feature i

sij=
1

m−1∑k=1
m

( xki− x̄i)(xkj− x̄ j)

r ij=
sij
si s j

Topics

• Exploratory Data Analysis
• Summary Statistics
• Visualization

Visualization

Visualization is the conversion of data into a visual
or tabular format so that the characteristics of the
data and the relationships among data items or
attributes can be analyzed or reported.

• Visualization of data is one of the most powerful and
appealing techniques for data exploration.
– Humans have a well developed ability to analyze

large amounts of information that is presented
visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns

Example: Sea Surface Temperature
• The following shows the Sea Surface Temperature

(SST) for July 1982
– Tens of thousands of data points are

summarized in a single figure

Representation

• Is the mapping of information to a visual format
• Data objects, their attributes, and the relationships

among data objects are translated into graphical
elements such as points, lines, shapes, and colors.

• Example:
– Objects are often represented as points
– Their attribute values can be represented as the

position of the points or the characteristics of the
points, e.g., color, size, and shape
– If position is used, then the relationships of points,

i.e., whether they form groups or a point is an
outlier, is easily perceived.

Arrangement

• Is the placement of visual elements within a
display

• Can make a large difference in how easy it is to
understand the data

• Example:

Selection

• Is the elimination or the deemphasis of certain objects and
attributes

• Selection may involve the choosing a subset of attributes
– Dimensionality reduction is often used to reduce the

number of dimensions to two or three
– Alternatively, pairs of attributes can be considered

• Selection may also involve choosing a subset of objects
– A region of the screen can only show so many points
– Can sample, but want to preserve points in sparse areas

Histograms

• Usually shows the distribution of values of a single variable
• Divide the values into bins and show a bar plot of the number of objects in each bin.
• The height of each bar indicates the number of objects
• Shape of histogram depends on the number of bins

• Example: Petal Width (10 and 20 bins, respectively)

Empirical Cumulative Distribution
Function (ECDF)

• Probability Density
Function (

PDF

): describes
the relative likelihood for
this random variable to take
on a given value

• Cumulative Distribution
Function (CDF): Shows
the distribution of data as
the fraction of points that
are less than this value.

CDF

PDF

Example: ECDF

Two-Dimensional Histograms
• Show the joint distribution of the values of two attributes
• Example: petal width and petal length

– What does this tell us?

Box Plots
• Invented by J. Tukey
• Another way of displaying the distribution of data
• Following figure shows the basic part of a box plot

outlier

25th percentile
– 1.5

IQR

25th percentile

75th percentile

50th percentile

75th percentile
+ 1.5 IQR

IQR

Q1 Q3

IQR
Median

Q3 + 1.5 × IQRQ1 − 1.5 × IQR

−0.6745 σ 0.6745 σ 2.698 σ−2.698 σ

50%24.65% 24.65%

68.27% 15.73%15.73%

−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ

−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ
−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ

Example of Box Plots
• Box plots can be used to compare attributes

Scatter Plots

– Attributes values determine the position
– Two-dimensional scatter plots most common,

but can have three-dimensional scatter plots
-Often additional attributes can be displayed by

using the size, shape, and color of the markers
that represent the objects
– It is useful to have arrays of scatter plots can

compactly summarize the relationships of
several pairs of attributes

• See example on the next slide

Scatter Plot Array of Iris Attributes

Contour Plots

– Useful when a continuous attribute is measured

on a spatial grid
– They partition the plane into regions of similar

values
– The contour lines that form the boundaries of

these regions connect points with equal values
– The most common example is contour maps of

elevation
– Can also display temperature, rainfall, air

pressure, etc.
 An example for Sea Surface Temperature (SST) is

provided on the next slide

Contour Plot Example: SST Dec, 1998

Celsius

Matrix Plots

– Can plot a data matrix
– Can be useful when objects are sorted

according to class
– Typically, the attributes are normalized to

prevent one attribute from dominating the plot
– Plots of similarity or distance matrices can also

be useful for visualizing the relationships
between objects

Visualization of the Iris Data Matrix

standard
deviation

Deviation form feature mean

Visualization of the Iris Correlation Matrix

Parallel Coordinates

– Used to plot the attribute values of high-

dimensional data
– Instead of using perpendicular axes, use a set of

parallel axes
– The attribute values of each object are plotted as a

point on each corresponding coordinate axis and
the points are connected by a line

– Thus, each object is represented as a line
– Often, the lines representing a distinct class of

objects group together, at least for some attributes
– Ordering of attributes is important in seeing such

groupings

Parallel Coordinates Plots for Iris Data

Reordered features

Other Visualization Techniques

Star Plots
– Similar approach to parallel coordinates, but axes radiate

from a central point
– The line connecting the values of an object is a polygon

Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute with a characteristic

of a face
– The values of each attribute determine the appearance of the

corresponding facial characteristic
– Each object becomes a separate face
– Relies on human’s ability to distinguish faces

Star Plots for Iris Data

Setosa

Versicolor

Virginica

Chernoff Faces for Iris Data
Setosa

Versicolor
Virginica

OLAP
• On-Line Analytical Processing (OLAP) was

proposed by E. F. Codd, the father of the
relational database.

• Relational databases put data into tables, while
OLAP uses a multidimensional array
representation.
– Such representations of data previously existed in

statistics and other fields

• There are a number of data analysis and data
exploration operations that are easier with such
a data representation.

Creating a Multidimensional Array
• Two key steps in converting tabular data into a

multidimensional array.
– First, identify which attributes are to be the dimensions

and which attribute is to be the target attribute whose
values appear as entries in the multidimensional array.
• The attributes used as dimensions must have discrete values
• The target value is typically a count or continuous value, e.g.,

the cost of an item
• Can have no target variable at all except the count of objects

that have the same set of attribute values

– Second, find the value of each entry in the
multidimensional array by summing the values (of the
target attribute) or count of all objects that have the
attribute values corresponding to that entry.

Example: Iris data
• We show how the attributes, petal length, petal

width, and species type can be converted to a
multidimensional array
– First, we discretized the petal width and length to

have categorical values: low, medium, and high

– We get the following table – note the count
attribute

Example: Iris data (continued)
• Each unique tuple of petal width, petal length,

and species type identifies one element of the
array.

• This element is assigned the corresponding
count value.

• The figure illustrates
the result.

• All non-specified
tuples are 0.

Example: Iris data (continued)
• Slices of the multidimensional array are shown

by the following cross-tabulations

• What do these tables tell us?

OLAP Operations: Data Cube
• The key operation of a OLAP is the formation of a

data cube
• A data cube is a multidimensional representation

of data, together with all possible aggregates.
• By all possible aggregates, we mean the aggregates

that result by selecting a proper subset of the
dimensions and summing over all remaining
dimensions.

• For example, if we choose the species type
dimension of the Iris data and sum over all other
dimensions, the result will be a one-dimensional
entry with three entries, each of which gives the
number of flowers of each type.

Data Cube Example
• Consider a data set that records the sales of

products at a number of company stores at
various dates.

• This data can be represented
as a 3 dimensional array

• There are 3 two-dimensional
aggregates (3 choose 2 ),
3 one-dimensional aggregates,
and 1 zero-dimensional
aggregate (the overall total)

Data Cube Example (continued)
• The following figure table shows one of the two

dimensional aggregates, along with two of the
one-dimensional aggregates, and the overall
total

OLAP Operations: Slicing and Dicing
• Slicing is selecting a group of cells from the

entire multidimensional array by specifying a
specific value for one or more dimensions.

• Dicing involves selecting a subset of cells by
specifying a range of attribute values.

– This is equivalent to defining a subarray from the
complete array.

• In practice, both operations can also be
accompanied by aggregation over some
dimensions.

OLAP Operations: Roll-up and Drill-down

• Attribute values often have a hierarchical
structure.
– Each date is associated with a year, month, and week.

– A location is associated with a continent, country, state
(province, etc.), and city.

– Products can be divided into various categories, such as
clothing, electronics, and furniture.

• Note that these categories often nest and form a
tree or lattice
– A year contains months which contains day

– A country contains a state which contains a city

OLAP Operations: Roll-up and Drill-down

• This hierarchical structure gives rise to the roll-
up and drill-down operations.

– For sales data, we can aggregate (roll up) the sales
across all the dates in a month.

– Conversely, given a view of the data where the time
dimension is broken into months, we could split the
monthly sales totals (drill down) into daily sales
totals.

– Likewise, we can drill down or roll up on the
location or product ID attributes.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Data Mining Project ”

Get high-quality paper

NEW! AI matching with writer

Hire a Writer

Client Reviews

4.9

Sitejabber

4.6

Trustpilot

4.8

Our Guarantees

100% Confidentiality

Information about customers is confidential and never disclosed to third parties.

Original Writing

We complete all papers from scratch. You can get a plagiarism report.

Timely Delivery

No missed deadlines – 97% of assignments are completed in time.

Money Back

If you're confident that a writer didn't follow your order details, ask for a refund.

New to Your Trusted Assignment Help Service? Sign up & Save

Calculate the price of your order

Type of paper needed:

Pages:

You will get a personal manager and a discount.

Academic level:

We'll send you the first draft for approval by at

Total price:

$0.00

Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.

Power up Your Study Success with Experts We’ve Got Your Back.

Order Now Order Now