Home Uncategorized Data mining question

Uncategorized

Data mining question

Please answer the question in the attached document.

[Your

Name Here]

Obtain one of the data sets available at the UCI

Machine Learning Repository and apply as many

of the different visualization techniques described in the chapter as possible. The bibliographic

notes and book Web site provide pointers to visualization software.

Identify at least two advantages and two di

sadvantages of using color to visually

represent

information

What are the

arrangement issues that arise

with respect to

three

–

dimensional

plots

Discuss the advantages and disadvantages of using sampling to reduce the number of data

objects that

need to be displayed. Would simple random sampling

(without replacement) be a

good approach to sampling? Why or why not

Describe how you would create visualizations to display information that de

–

scribes the

following types of systems.

Computer

networks. Be sure to include both the static aspects of the network, such as

connectivity, and the dynamic aspects, such as tra

ﬃ

The distribution of speci

ﬁ

c plant and animal species around the world fora speci

ﬁ

c moment

in time.

The use of computer resour

ces, such as processor time, main memory, and disk, for a set of

benchmark database programs.

The change in occupation of workers in a particular country over the last thirty years.

Assume that you have yearly information about each person that also includ

es gender and

level of education.

Be sure to address the following issues:

Representation

. How will you map objects, attributes, and relation

–

ships to visual

elements?

· Arrangement. Are there any special considerations that need to be taken into account with respect to how visual elements are displayed? Speciﬁc examples might be the choice of viewpoint, the use of transparency, or the separation of certain groups of objects.

· Selection. How will you handle a large number of attributes and data objects

6. Describe one advantage and one disadvantage of a stem and leaf plot with respect to a standard histogram.

7. How might you address the problem that a histogram depends on the number and location of the bins?

8. Describe how a box plot can give information about whether the value of an attribute is symmetrically distributed. What can you say about the symmetry of the distributions of the attributes shown in Figure 3.11?

9. Compare sepal length, sepal width, petal length, and petal width, using Figure3.12.

10. Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income.

11. Give a possible explanation as to why most of the values of petal length and width fall in the buckets along the diagonal in Figure 3.9.

12. Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width and petal length attributes.

13. Simple line plots, such as that displayed in Figure 2.12 on page 56, which shows two time series, can be used to eﬀectively display high-dimensional data. For example, in Figure 2.12 it is easy to tell that the frequencies of the two time series are diﬀerent. What characteristic of time series allows the eﬀective visualization of high-dimensional data?

14. Describe the types of situations that produce sparse or dense data cubes. Illustrate with examples other than those used in the book.

15. How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would be of interest?

16. Construct a data cube from Table 3.14. Is this a dense or sparse data cube? If it is sparse, identify the cells that are empty.

17. Discuss the diﬀerences between dimensionality reduction based on aggregation and dimensionality reduction based on techniques such as PCA and SVD.

[Your Name Here]

Obtain one of the data sets available at the UCI
Machine Learning Repository and apply as many
of the different visualization techniques described in the chapter as possible. The bibliographic
notes and book Web site provide pointers to visualization software.

Identify at least two advantages and two di
sadvantages of using color to visually

represent
information
.

What are the

arrangement issues that arise

with respect to

three
–
dimensional

plots
?

Discuss the advantages and disadvantages of using sampling to reduce the number of data
objects that
need to be displayed. Would simple random sampling

(without replacement) be a
good approach to sampling? Why or why not
?

Describe how you would create visualizations to display information that de
–
scribes the
following types of systems.

Computer
networks. Be sure to include both the static aspects of the network, such as
connectivity, and the dynamic aspects, such as tra
?
c.

The distribution of speci
?
c plant and animal species around the world fora speci
?
c moment
in time.

The use of computer resour
ces, such as processor time, main memory, and disk, for a set of
benchmark database programs.

The change in occupation of workers in a particular country over the last thirty years.
Assume that you have yearly information about each person that also includ
es gender and
level of education.

Be sure to address the following issues:

Representation
. How will you map objects, attributes, and relation
–
ships to visual
elements?

[Your Name Here]

1. Obtain one of the data sets available at the UCI Machine Learning Repository and apply as many

of the different visualization techniques described in the chapter as possible. The bibliographic

notes and book Web site provide pointers to visualization software.

2. Identify at least two advantages and two disadvantages of using color to visually represent

information.

3. What are the arrangement issues that arise with respect to three-dimensional plots?

4. Discuss the advantages and disadvantages of using sampling to reduce the number of data

objects that need to be displayed. Would simple random sampling (without replacement) be a

good approach to sampling? Why or why not?

5. Describe how you would create visualizations to display information that de-scribes the

following types of systems.

a) Computer networks. Be sure to include both the static aspects of the network, such as

connectivity, and the dynamic aspects, such as tra?c.

b) The distribution of speci?c plant and animal species around the world fora speci?c moment

in time.

c) The use of computer resources, such as processor time, main memory, and disk, for a set of

benchmark database programs.

d) The change in occupation of workers in a particular country over the last thirty years.

Assume that you have yearly information about each person that also includes gender and

level of education.

Be sure to address the following issues:

 Representation. How will you map objects, attributes, and relation-ships to visual

elements?

download instant at www.easysemester.com

Exploring Data

1. Obtain one of the data sets available at the UCI Machine Learning Repository
and apply as many of the different visualization techniques described in the
chapter as possible. The bibliographic notes and book Web site provide
pointers to visualization software.

MATLAB and R have excellent facilities for visualization. Most of the fig-
ures in this chapter were created using MATLAB. R is freely available from
http://www.r-project.org/.

2. Identify at least two advantages and two disadvantages of using color to
visually represent information.

Advantages: Color makes it much easier to visually distinguish visual el-
ements from one another. For example, three clusters of two-dimensional
points are more readily distinguished if the markers representing the points
have different colors, rather than only different shapes. Also, figures with
color are more interesting to look at.

Disadvantages: Some people are color blind and may not be able to properly
interpret a color figure. Grayscale figures can show more detail in some cases.
Color can be hard to use properly. For example, a poor color scheme can be
garish or can focus attention on unimportant elements.

3. What are the arrangement issues that arise with respect to three-dimensional
plots?

It would have been better to state this more generally as “What are the issues
. . . ,” since selection, as well as arrangement plays a key issue in displaying a
three-dimensional plot.

The key issue for three dimensional plots is how to display information so
that as little information is obscured as possible. If the plot is of a two-
dimensional surface, then the choice of a viewpoint is critical. However, if the
plot is in electronic form, then it is sometimes possible to interactively change

download instant at www.easysemester.com
20 Chapter 3 Exploring Data

the viewpoint to get a complete view of the surface. For three dimensional
solids, the situation is even more challenging. Typically, portions of the
information must be omitted in order to provide the necessary information.
For example, a slice or cross-section of a three dimensional object is often
shown. In some cases, transparency can be used. Again, the ability to change
the arrangement of the visual elements interactively can be helpful.

4. Discuss the advantages and disadvantages of using sampling to reduce the
number of data objects that need to be displayed. Would simple random
sampling (without replacement) be a good approach to sampling? Why or
why not?

Simple random sampling is not the best approach since it will eliminate most
of the points in sparse regions. It is better to undersample the regions where
data objects are too dense while keeping most or all of the data objects from
sparse regions.

5. Describe how you would create visualizations to display information that
describes the following types of systems.

Be sure to address the following issues:

• Representation. How will you map objects, attributes, and relation-
ships to visual elements?

• Arrangement. Are there any special considerations that need to be
taken into account with respect to how visual elements are displayed?
Specific examples might be the choice of viewpoint, the use of trans-
parency, or the separation of certain groups of objects.

• Selection. How will you handle a large number of attributes and data
objects?

The following solutions are intended for illustration.

(a) Computer networks. Be sure to include both the static aspects of the
network, such as connectivity, and the dynamic aspects, such as traffic.

The connectivity of the network would best be represented as a graph,
with the nodes being routers, gateways, or other communications de-
vices and the links representing the connections. The bandwidth of the
connection could be represented by the width of the links. Color could
be used to show the percent usage of the links and nodes.

(b) The distribution of specific plant and animal species around the world
for a specific moment in time.

The simplest approach is to display each species on a separate map
of the world and to shade the regions of the world where the species
occurs. If several species are to be shown at once, then icons for each
species can be placed on a map of the world.

download instant at www.easysemester.com
21

(c) The use of computer resources, such as processor time, main memory,
and disk, for a set of benchmark database programs.

The resource usage of each program could be displayed as a bar plot
of the three quantities. Since the three quantities would have different
scales, a proper scaling of the resources would be necessary for this
to work well. For example, resource usage could be displayed as a
percentage of the total. Alternatively, we could use three bar plots, one
for type of resource usage. On each of these plots there would be a bar
whose height represents the usage of the corresponding program. This
approach would not require any scaling. Yet another option would be to
display a line plot of each program’s resource usage. For each program,
a line would be constructed by (1) considering processor time, main
memory, and disk as different x locations, (2) letting the percentage
resource usage of a particular program for the three quantities be the
y values associated with the x values, and then (3) drawing a line to
connect these three points. Note that an ordering of the three quantities
needs to be specified, but is arbitrary. For this approach, the resource
usage of all programs could be displayed on the same plot.

(d) The change in occupation of workers in a particular country over the
last thirty years. Assume that you have yearly information about each
person that also includes gender and level of education.

For each gender, the occupation breakdown could be displayed as an
array of pie charts, where each row of pie charts indicates a particu-
lar level of education and each column indicates a particular year. For
convenience, the time gap between each column could be 5 or ten years.

Alternatively, we could order the occupations and then, for each gen-
der, compute the cumulative percent employment for each occupation.
If this quantity is plotted for each gender, then the area between two
successive lines shows the percentage of employment for this occupa-
tion. If a color is associated with each occupation, then the area between
each set of lines can also be colored with the color associated with each
occupation. A similar way to show the same information would be to
use a sequence of stacked bar graphs.

6. Describe one advantage and one disadvantage of a stem and leaf plot with
respect to a standard histogram.

A stem and leaf plot shows you the actual distribution of values. On the
other hand, a stem and leaf plot becomes rather unwieldy for a large number
of values.

7. How might you address the problem that a histogram depends on the number
and location of the bins?

download instant at www.easysemester.com
22 Chapter 3 Exploring Data

The best approach is to estimate what the actual distribution function of the
data looks like using kernel density estimation. This branch of data analysis
is relatively well-developed and is more appropriate if the widely available,
but simplistic approach of a histogram is not sufficient.

8. Describe how a box plot can give information about whether the value of an
attribute is symmetrically distributed. What can you say about the symme-
try of the distributions of the attributes shown in Figure 3.11?

(a) If the line representing the median of the data is in the middle of the
box, then the data is symmetrically distributed, at least in terms of the
75% of the data between the first and third quartiles. For the remain-
ing data, the length of the whiskers and outliers is also an indication,
although, since these features do not involve as many points, they may
be misleading.

(b) Sepal width and length seem to be relatively symmetrically distributed,
petal length seems to be rather skewed, and petal width is somewhat
skewed.

9. Compare sepal length, sepal width, petal length, and petal width, using
Figure 3.12.

For Setosa, sepal length > sepal width > petal length > petal width. For
Versicolour and Virginiica, sepal length > sepal width and petal length >
petal width, but although sepal length > petal length, petal length > sepal
width.

10. Comment on the use of a box plot to explore a data set with four attributes:
age, weight, height, and income.

A great deal of information can be obtained by looking at (1) the box plots
for each attribute, and (2) the box plots for a particular attribute across
various categories of a second attribute. For example, if we compare the box
plots of age for different categories of ages, we would see that weight increases
with age.

11. Give a possible explanation as to why most of the values of petal length and
width fall in the buckets along the diagonal in Figure 3.9.

We would expect such a distribution if the three species of Iris can be ordered
according to their size, and if petal length and width are both correlated to
the size of the plant and each other.

12. Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal
width and petal length attributes.

download instant at www.easysemester.com
23

There is a relatively flat area in the curves of the Empirical CDF’s and the
percentile plots for both petal length and petal width. This indicates a set
of flowers for which these attributes have a relatively uniform value.

13. Simple line plots, such as that displayed in Figure 2.12 on page 56, which
shows two time series, can be used to effectively display high-dimensional
data. For example, in Figure 56 it is easy to tell that the frequencies of the
two time series are different. What characteristic of time series allows the
effective visualization of high-dimensional data?

The fact that the attribute values are ordered.

14. Describe the types of situations that produce sparse or dense data cubes.
Illustrate with examples other than those used in the book.

Any set of data for which all combinations of values are unlikely to occur
would produce sparse data cubes. This would include sets of continuous
attributes where the set of objects described by the attributes doesn’t occupy
the entire data space, but only a fraction of it, as well as discrete attributes,
where many combinations of values don’t occur.

A dense data cube would tend to arise, when either almost all combinations of
the categories of the underlying attributes occur, or the level of aggregation is
high enough so that all combinations are likely to have values. For example,
consider a data set that contains the type of traffic accident, as well as its
location and date. The original data cube would be very sparse, but if it is
aggregated to have categories consisting single or multiple car accident, the
state of the accident, and the month in which it occurred, then we would
obtain a dense data cube.

15. How might you extend the notion of multidimensional data analysis so that
the target variable is a qualitative variable? In other words, what sorts of
summary statistics or data visualizations would be of interest?

A summary statistics that would be of interest would be the frequencies with
which values or combinations of values, target and otherwise, occur. From
this we could derive conditional relationships among various values. In turn,
these relationships could be displayed using a graph similar to that used to
display Bayesian networks.

download instant at www.easysemester.com
24 Chapter 3 Exploring Data

16. Construct a data cube from Table 3.1. Is this a dense or sparse data cube?
If it is sparse, identify the cells that are empty.

The data cube is shown in Table 3.2. It is a dense cube; only two cells are
empty.

Table 3.1. Fact table for Exercise 16.
Product ID Location ID Number Sold

1 1 10
1 3 6
2 1 5
2 2 22

Table 3.2. Data cube for Exercise 16.
Location

1 2 3 Total
1 10 0 6 16
2 5 22 0 27

Total 15 22 6 43P
r
o
d
u
c
t

17. Discuss the differences between dimensionality reduction based on aggrega-
tion and dimensionality reduction based on techniques such as PCA and
SVD.

The dimensionality of PCA or SVD can be viewed as a projection of the
data onto a reduced set of dimensions. In aggregation, groups of dimensions
are combined. In some cases, as when days are aggregated into months or
the sales of a product are aggregated by store location, the aggregation can
be viewed as a change of scale. In contrast, the dimensionality reduction
provided by PCA and SVD do not have such an interpretation.

<< /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Error /CompatibilityLevel 1.4 /CompressObjects /Tags /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJDFFile false /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedJobOptions true /DSCReportingLevel 0 /SyntheticBoldness 1.00 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveEPSInfo true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile () /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /DownsampleColorImages true /ColorImageDownsampleType /Bicubic /ColorImageResolution 300 /ColorImageDepth -1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /DCTEncode /AutoFilterColorImages true /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >>
/ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >>
/JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >>
/JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >>
/AntiAliasGrayImages false
/DownsampleGrayImages true
/GrayImageDownsampleType /Bicubic
/GrayImageResolution 300
/GrayImageDepth -1
/GrayImageDownsampleThreshold 1.50000
/EncodeGrayImages true
/GrayImageFilter /DCTEncode
/AutoFilterGrayImages true
/GrayImageAutoFilterStrategy /JPEG
/GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >>
/GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >>
/JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >>
/JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >>
/AntiAliasMonoImages false
/DownsampleMonoImages true
/MonoImageDownsampleType /Bicubic
/MonoImageResolution 1200
/MonoImageDepth -1
/MonoImageDownsampleThreshold 1.50000
/EncodeMonoImages true
/MonoImageFilter /CCITTFaxEncode
/MonoImageDict << /K -1 >>
/AllowPSXObjects false
/PDFX1aCheck false
/PDFX3Check false
/PDFXCompliantPDFOnly false
/PDFXNoTrimBoxError true
/PDFXTrimBoxToMediaBoxOffset [
0.00000
0.00000
0.00000
0.00000
]
/PDFXSetBleedBoxToMediaBox true
/PDFXBleedBoxToTrimBoxOffset [
0.00000
0.00000
0.00000
0.00000
]
/PDFXOutputIntentProfile ()
/PDFXOutputCondition ()
/PDFXRegistryName (http://www.color.org)
/PDFXTrapped /Unknown
/Description << /ENU (Use these settings to create PDF documents with higher image resolution for high quality pre-press printing. The PDF documents can be opened with Acrobat and Reader 5.0 and later. These settings require font embedding.) /JPN
/FRA
/DEU
/PTB
/DAN
/NLD
/ESP
/SUO
/ITA
/NOR
/SVE
>>
>> setdistillerparams
<< /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Data mining question ”

Get high-quality paper

NEW! AI matching with writer

Hire a Writer

Client Reviews

4.9

Sitejabber

4.6

Trustpilot

4.8

Our Guarantees

100% Confidentiality

Information about customers is confidential and never disclosed to third parties.

Original Writing

We complete all papers from scratch. You can get a plagiarism report.

Timely Delivery

No missed deadlines – 97% of assignments are completed in time.

Money Back

If you're confident that a writer didn't follow your order details, ask for a refund.

New to Your Trusted Assignment Help Service? Sign up & Save

Calculate the price of your order

Type of paper needed:

Pages:

You will get a personal manager and a discount.

Academic level:

We'll send you the first draft for approval by at

Total price:

$0.00

Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.

Power up Your Study Success with Experts We’ve Got Your Back.

Order Now Order Now