Module 3-223

Help Psychology statistic

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

When you were a child, you might have seen a dark brown bottle of a mysterious liquid high on a shelf in your garage, out of your reach. “That’s poison,” a parent might have warned, pointing to the container. “Don’t touch.” It nonetheless had some use. It was there for a reason. This example illustrates a principle: Items that can be misused often have a valid use, and vice versa. An average (e.g. mean, median, and mode) is not an exception; it can be useful, but if it is misused or misinterpreted it can be destructive. In what way might an average be misused? Alternatively, how might an average be misinterpreted? For instance, how can a misinterpreted average pertain to stereotyping? How can we avoid misuse or misinterpretation of averages? Provide a specific example to illustrate your explanation.

3-2 Exercise: Measures of Average and Measures of Deviation

 

Previous 

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Next 

Instructions

Using Microsoft Excel, complete the following exercises out of your textbook:

· Exercise 14: Perception of Time (Section 4.1: p. 125)

14. Percept ion of Time. The data below are times (in seconds) recorded when statistics students participated in an experiment to test their ability to determine when 1 minute (60 seconds) had passed. What do these data suggest about students’ per-ception of time? 53   52   75   62   68   58   49   49

· Exercise 13: Perception of Time (Section 4.3: p. 145)

13. Percept ion of Time. The following times (in seconds) were recorded when statistics students participated in an experiment to test their ability to determine when 1 minute (60 seconds) had passed: 53   52   75   62   68   58   49   49

In these exercises, you will practice computing measures of average and measures of deviation.

Submit both a Word document and two Excel documents as needed to show calculations and interpret the results.

4

D

escribing Data

 

4.

1

What Is

A

verage?

4.1

What Is Average?

The term average is use

d

so often that you may be surprised that it does

no

t always have the same meaning. In this section, we’ll explore

common

ways of characterizing the center of a data distribution, all of which are sometimes called the “average.”

Mean

,

Median

, and

Mode

Table 4.1 shows the number of movies (original and sequels or prequels) in each of five popular science fiction film franchises. What is the average number of movies in these franchises? One way to answer this question is to compute the mean. (The formal term arithmetic mean is commonly referred to simply as the mean.) We find the mean by dividing the total number of movies by five (because there are five franchises listed in the data set):

mean=

5

+

8

+1

3

+

7

+55=

38

5=7.

6

mean=5+8+

13

+7+55=3

85

=7.6

mean , equals . fraction 5 plus 8 plus 13 plus 7 plus 5 , over 5 end fraction . equals , 38 over 5 , equals 7.6

We say that these five franchises have a mean of 7.6 movies. More generally, we find the mean of any data set by dividing the sum of all of the data values by the number of data values. The mean is what most people think of as the average. In essence, it represents the balance point for a quantitative data distribution, as shown in

F

igure 4.1.

TA

B

L

E

4.1 Five Science Fiction Film Franchises

Franchise

N

umber of movies (as of

2

0

16

)

Alien

5

Planet of the Apes

8

Star Trek

13

Star Wars

7

Terminator

5

Figure 4.1 A histogram made from blocks would balance at the position of its mean.

We could also describe the average number of movies by using another measure of center: the median, or middle value, of the data set. To find a median, we first arrange the data values in ascending (or descending) order, repeating data values that appear more than once. The next step depends on whether the number of data values is odd or even.

·

• Odd: If the number of values is odd, there is exactly one value in the middle of the list, and this value is the median.

· • Even: If the number of values is even, there are two values in the middle of the list, and the median is the number that lies halfway between them.

For example, arranging the data in Table 4.1 in ascending order, we get 5, 5, 7, 8, 13. There are five data values, which is an odd number. Therefore, the median number of movies among these franchises is 7, because 7 is the middle number in the list.

The mode is the most common value or group of values in a data set. In the case of the film franchises, the mode is 5 movies because this value occurs twice in the data set, while the other values occur only once. A data set may have one mode, more than one mode, or no mode. Sometimes the mode refers to a group of closely spaced values rather than a single value. The mode is used more commonly for qualitative data than for quantitative data, because neither the mean nor the median can be used with qualitative data.

4 Describing Data
 > 

4.1 What Is Average?

Definitions: Measures of

C

enter in a Distribution

The mean is what we most commonly call the average value. It is found as follows:

mean=sum of all valuestotal number of valuesmean=sum of all valuestotal number of values

mean , equals . fraction sumofallvalues , over totalnumberofvalues end fraction

The median is the middle value in the sorted data set, or halfway between the two middle values if the number of values is even.

The mode is the most common value (or group of values) in a data set.

THINK ABOUT IT Answer the following questions about the block representation in Figure 4.1. (a) Where is the median? (b) Is the median less than, greater than, or equal to the mean? (c) Which block represents the mode? Explain your answers.

When rounding, we will use the following rule for all the calculations discussed in this chapter.

Rounding Rule for Statistical Calculations

In general, you should express answers with one more decimal place of precision than is found in the original list of data. For example, if the data are given as whole numbers, you should round their mean to the nearest tenth; if the data are given to the nearest tenth (with one decimal place), you should round their mean to the nearest hundredth (with two decimal places); and so on. As always, round only the final answer and not any intermediate values used in your calculations.

Notice how we applied this rule in the film franchise example. The data in Table 4.1 consist of whole numbers, so the mean is given as 7.6.

Example 1 Price Data

Eight grocery stores sell the PR energy bar for the following prices:

$

1.7

9

 $1.

29

 $1.29 $

1.3

5 $1.

39

 $1.

49

 $

1.5

9 $1.09$1.

79

 $1.29 $1.29 $1.

35

 $1.39 $

1.4

9 $1.59 $1.09

dollars , 1.79 , dollars , 1.29 , dollars , 1.29 , dollars , 1.35 , dollars , 1.39 , dollars , 1.49 , dollars , 1.59 , dollars , 1.09

Find the mean, median, and mode for these prices.

SOLUTION

The mean price is $1.

41

:

mean =$1.79+$1.29+$1.29+$1.35+$1.39+$1.49+$1.59+$1.098=$1.41

4 Describing Data

 > 

4.1 What Is Average?

 > 

Effects of Outliers

To find the median, we first sort the data in ascending order:

$1.09,$1.29,$1.293 values below,$1.35,$1.392 middle values,$1.49,$1.59,$1.793 values above$1.09,$1.29,$1.29︸3 values below,$1.35,$1.39︸2 middle values,$1.49,$1.59,$1.79︸3 values above

modified modified dollars , 1.09 , comma dollars , 1.29 , comma dollars , 1.29 with under curly bracket below with 3 . valuesbelow below . comma . modified modified dollars , 1.35 , comma dollars , 1.39 with under curly bracket below with 2 . middlevalues below . comma . modified modified dollars , 1.49 , comma dollars , 1.59 , comma dollars , 1.79 with under curly bracket below with 3 . valuesabove below

Because there are eight prices (an even number), there are two values in the middle of the list: $1.35 and $1.39. Therefore the median lies halfway between these two values, which we calculate by adding them and dividing by 2:

median=$1.35+$1.3

92

 =$1.

37

median=$1.35+$1.392 =$

1.37

median , equals . fraction dollars , 1.35 , plus dollars , 1.39 , over 2 end fraction . equals dollars , 1.37

Using the rounding rule, we could express the mean and median as $1.4

10

and $1.3

70

, respectively, though the extra zeros are optional in this case.

The mode is $1.29 because this price occurs more times than any other price.

 Using Technology: Mean, Median, and Mode

EXCEL 

Y

ou can use Excel’s built-in function AVERAGE for calculating a mean and its separate functions MEDIAN and MODE for finding those statistics. The screen shot below shows the use of these functions for the data from Example 1. Column E shows the functions and Column F shows the results. Note: If a data set has no mode, the MODE function returns “#N/A”; if it has more than one mode, the MODE function will return only the first mode. (The Excel function MODE.MULT can find multiple modes, but its use is more complex; find details by entering the search term “Excel MODE.MULT” in your Web browser.)

Source: MS Excel

20

13.

You can get even more information by installing the Data Analysis Toolpak (available with many but not all versions of Excel); to find it, use the Help feature and search for “Data analysis.” Once the toolkit is installed, click on Data, select Data Analysis, then select Descriptive Statistics in the pop-up window, and click OK. In the dialog box, enter the input range (such as B1:B8 for the 8 values in Column B), click on Summary Statistics, then click OK. Results will include the mean and median, as well as other statistics to be discussed in the following sections.

Alternatively, the XLSTAT add-in that is a supplement to this text can be used with both Windows and Mac computers. Enter the data as described above, click on XLSTAT, click on Describing Data, and then select Descriptive Statistics. Select Quantitative Data and enter the range of the data, such as B1:B8. (If the first row is the name of the data, be sure to click on the box next to “Sample labels.”) Click OK.

STATDISK Enter the data in the Sample Editor or open an existing data set. Click on Data and select Explore Data – Descriptive Statistics. Select the data column to explore and then click Evaluate to get the various descriptive statistics, including the mean and median, as well as other statistics to be discussed in the following sections.

TI-83/84 PLUS First enter the data in list L1 by pressing STAT,STAT, begin box , then selecting Edit and pressing the ENTERENTER begin box , key. After the data values have been entered, press STATSTAT begin box , and select CALC, then select 1-Var Stats. Enter L1 for List, leave FreqList blank, select Calculate, and press the ENTERENTER begin box , key twice. The results will include the mean and median, as well as other statistics to be discussed in the following sections. Use the down-arrow key ˅˅ begin box , u 0 , end box to view the results that don’t fit on the initial display.

Effects of Outliers

To explore the differences among the mean, median, and mode, imagine that the five graduating seniors on a men’s college basketball team receive the following first-year contract offers to play in the National Basketball Association (zero indicates that the player did not receive a contract offer):

0 0 0 0 $10,000,000

4 Describing Data

 > 
4.1 What Is Average?
 > 
Effects of Outliers

The mean contract offer is

mean= 0+0+0+0+$10,000,0005=$2,000,000mean= 0+0+0+0+$10,000,0005=$2,000,000

mean , equals . fraction 0 plus 0 plus 0 plus 0 plus dollars . 10,000,000 , over 5 end fraction . equals dollars . 2,000,000

Is it therefore fair to say that the average senior on this basketball team received a $2 million contract offer?

By the Way

A survey once found that geography majors from the University of North Carolina had a far higher mean starting salary than geography majors from other schools. The reason for the high mean turned out to be a single outlier—the basketball superstar and geography major named Michael Jordan.

Not really. The problem is that the single player receiving the large offer makes the mean much larger than it would be otherwise. If we ignore this one player and look only at the other four, the mean contract offer is zero. Because the single value of $10,000,000 is so extreme compared with the others, we say that it is an outlier (or outlying value). As our example shows, an outlier can pull the mean significantly upward (or downward), thereby making the mean unrepresentative of the data set as a whole.

Definition

An outlier in a data set is a value that is much higher or much lower than almost all other values.

While the outlier pulls the mean contract offer upward, it has no effect on the median contract offer, which remains zero for the five players. In general, the value of an outlier has little or no effect on either the median or mode, because outliers don’t lie in the middle of a data set and are not common values. Table

4.2

summarizes the characteristics of the mean, median, and mode, including the effects of outliers on each measure.

TABLE 4.2 Comparison of Mean, Median, and Mode

Measure

Definition

How common?

Existence

Takes every value into account?

Affected by outliers?

Advantages

Mean

sum of all valuestotal number of valuessum of all valuestotal number of valuesfraction sumofallvalues , over totalnumberofvalues end fraction

most familiar “average”

always exists

yes

yes

commonly understood; works well with many statistical methods

Median

middle value (of an ordered data set)

common

always exists

no (aside from counting the total number of values)

no

when there are outliers, may be more representative of an “average” than the mean

Mode

most frequent value

sometimes used

may be no mode, one mode, or more than one mode

no

no

most appropriate for qualitative data (see Section 2.1)

THINK ABOUT IT Is it fair to use the median as the average contract offer for the five players? Why or why not?

Deciding how to deal with outliers is one of the more important issues in statistics. Sometimes, as in our basketball example, an outlier is a legitimate value that must be understood in order to interpret the mean and median properly. Other times, outliers may indicate mistakes in a data set. Deciding when outliers are important and when they may simply be mistakes can be very difficult.

Example 2 Mistake?

A track coach wants to determine an appropriate heart rate for her athletes during their workouts. She chooses five of her best runners and asks them to wear heart rate monitors during a workout. In the middle of the workout, she records these values (in beats per minute): 1

30

, 135, 1

40

,

14

5, 3

25

. Which is a better measure of the average in this case—the mean or the median? Why?

4 Describing Data

 > 

4.1 What Is Average?

 > 

“Average” Confusion

SOLUTION

Four of the five values are fairly close together and seem reasonable for mid-workout heart rates. The high value of

32

5 beats per minute is an outlier. This outlier seems likely to be a mistake (perhaps caused by a faulty heart rate monitor), because anyone with such a high heart rate would be in cardiac arrest. If the coach uses the mean as the average, she will be including this outlier—which means she will be including any mistake made when it was recorded. If she uses the median as the average, she’ll have a more reasonable value, because the median won’t be affected by the outlier.

“Average” Confusion

The different meanings of average can lead to confusion. Sometimes this confusion arises because we are not told whether the average is the mean or the median, and other times it happens because we are not given enough information about how the average was computed. The following examples illustrate two such situations.

Example 3 Wage Dispute

A newspaper surveys wages for workers in high-tech companies in the region and reports an average of $

42

per hour. The workers at one large firm immediately request a pay raise, claiming that they work as hard as employees at other companies but their average wage is only $

36

. The management rejects their request, telling them that they are overpaid because their average wage, in fact, is $

48

. Can both sides be right? Explain.

SOLUTION

Both sides can be right if they are using different definitions of average. In this case, the workers may be using the median while the management is using the mean. For example, imagine that there are only five workers at the company and their wages are $36, $36, $36, $36, and $9

60

. The median of these five wages is $36 (as the workers claimed), but the mean is $48 (as management claimed).

Figures won’t lie, but liars will figure.

—Charles H. Grosvenor

Example 4 Which Mean?

All

100

first-year students at a small college take three courses in the Core Studies program. Two courses are taught in large lectures, with all 100 students in a single class. The third course is taught in 10 classes of 10 students each. Students and administrators get into an argument about whether classes are too large. The students claim that the mean size of their Core Studies classes is 70. The administrators claim that the mean class size is only 25. Can both sides be right? Explain.

SOLUTION

The students calculated the mean size of the classes in which each student is personally enrolled. Each student is taking two classes with enrollments of 100 and one class with an enrollment of 10, so the mean size of each student’s classes is

mean=total enrollment in student’s classesnumber of classes student is taking=100+100+103=70 studentsmean=total enrollment in student’s classesnumber of classes student is taking=100+100+103=70 students

mean , equals . fraction totalenrollmentinstudent . apostrophe s , classes , over numberofclassesstudentistaking end fraction . equals . fraction 100 plus 100 plus 10 , over 3 end fraction . equals 70 . students

The administrators calculated the mean enrollment in all classes. There are two classes with 100 students and 10 classes with 10 students, making a total enrollment of

300

students in

12

classes. The mean enrollment per class is

total enrollmentnumber of classes=30012=25 studentstotal enrollmentnumber of classes=30012=25 students

fraction totalenrollment , over numberofclasses end fraction . equals , 300 over 12 , equals 25 . students

The two claims about the mean are both correct, but the two sides are talking about different means. The students calculated the mean class size per student, while the administrators calculated the mean number of students per class.

THINK ABOUT IT In Example 4, could the administrators redistribute faculty assignments so that all classes have 25 students each? How? Discuss the advantages and disadvantages of such a change.

4 Describing Data

 > 

4.1 What Is Average?

 > 

Weighted Mean

Weighted Mean

Suppose your course grade is based on four quizzes and one final exam. Each quiz counts as

15

% of your final grade, and the final counts as

40%

. Your quiz scores are

75

,

80

, 84, and

88

, and your final exam score is 96. What is your overall score?

By the Way

Sports statistics that rate players or teams according to their performance in many different categories are usually weighted means. Examples include the earned run average (ERA) and the slugging percentage in baseball, the quarterback rating in football, and computerized rankings of college teams.

Because the final exam counts more than the quizzes, a simple mean of the five scores does not give your final score. Instead, we must assign a weight (indicating the relative importance) to each score. In this case, we assign weights of 15 (for the 15%) to each of the quizzes and 40 (for the 40%) to the final. We then find the weighted mean by adding the products of each score and its weight and then dividing by the sum of the weights:

weighted mean =(75×15)+(80×15)+(84×15)+(88×15)+(96×40)15 + 15 + 15 + 15 + 40=8

74

51

00=8

7.4

5weighted mean=(75×15)+(80×15)+(84×15)+(88×15)+(96×40)15 + 15 + 15 + 15 + 40 =87

45

100=87.45

table with 2 rows and 2 columns , row1 column 1 , weightedmean , column 2 equals . fraction open , 75 times 15 , close . plus . open , 80 times 15 , close . plus . open , 84 times 15 , close . plus . open , 88 times 15 , close . plus . open , 96 times 40 , close , over 15 plus 15 plus 15 plus 15 plus 40 end fraction , row2 column 1 , , column 2 equals . 8745 over 100 . equals , 87.45 , end table The weighted mean of 87.45 properly accounts for the different weights of the quizzes and the exam. Following the rounding rule, we round this score to 8

7.5

.

Weighted means are appropriate whenever the data values vary in their degree of importance. You can always find a weighted mean using the following formula.

Definition

A weighted mean accounts for variations in the relative importance of data values. Each data value is assigned a weight and the weighted mean is

weighted mean=sum of (each data value×its weight)sum of all weightsweighted mean=sum of (each data value×its weight)sum of all weights

weightedmean . equals . fraction sumof , . open . eachdatavalue . times . itsweight . close , over sumofallweights end fraction

THINK ABOUT IT Because the weights in the course grade example are percentages, we could think of them as 0.15 and 0.40 rather than 15% and 40%. Calculate the weighted mean by using the weights 0.15 and 0.40. Do you still get the same answer? Why or why not?

Example 5 GPA

Randall has 38 credits with a grade of A,

22

credits with a grade of B, and 7 credits with a grade of C. What is his grade point average (GPA)? Base the GPA on values of 4 points for an A, 3 points for a B, and 2 points for a C.

SOLUTION

The grades of A, B, and C represent data values of 4, 3, and 2, respectively. The numbers of credits are the weights. The As represent a data value of 4 with a weight of 38, the Bs represent a data value of 3 with a weight of 22, and the Cs represent a data value of 2 with a weight of 7. The weighted mean is

weighted mean=(4×38)+(3×22)+(2×7)38+22+7=2

3

26

7≈3.

46

weighted mean=(4×38)+(3×22)+(2×7)38+22+7=

23

267≈3.46

weightedmean . equals . fraction open , 4 times 38 , close . plus . open , 3 times 22 , close . plus . open , 2 times 7 , close , over 38 plus 22 plus 7 end fraction . equals , 232 over 67 , almost equal to , 3.46

Following our rounding rule, we round Randall’s GPA from 3.46 to 3.5.

Example 6 Stock Voting

Voting in corporate elections is usually weighted by the amount of stock owned by each voter. Suppose a company has five stockholders who vote on whether the company should embark on a new advertising campaign. The votes (Y=yes,N=no)(Y=yes,N=no) open . cap y equals yes comma cap n equals no . close are as follows:

Y

Y

N

Stockholder

Shares owned

Vote

A

225

Y

B

17

0

C

27

5

D

50

0

N
E

90

4 Describing Data

 > 

4.1 What Is Average?

 > 

Means with Summation Notation (Optional Subsection)

According to the company’s bylaws, a measure passes only if at least 60% of the votes are in favor of it. Does this measure pass?

SOLUTION

We can regard a yes vote as having a value of 1 and a no vote a value of 0. The number of shares is the weight for the vote of each stockholder, so Stockholder A’s vote represents a value of 1 with a weight of 225, stockholder B’s vote represents a value of 1 with a weight of 170, and so on. The weighted mean of the votes is

weighted mean =(1×225)+(1×170)+(1×

275

)+(0×

500

)+(0×90)225+170+275+500+90=6701260≈0.

53

weighted mean=(1×225)+(1×170)+(1×275)+(0×500)+(0×90)225+170+275+500+90 =6701260≈0.53

table with 2 rows and 2 columns , row1 column 1 , weightedmean , column 2 equals . fraction open , 1 times 225 , close . plus . open , 1 times 170 , close . plus . open , 1 times 275 , close . plus . open , 0 times 500 , close . plus . open , 0 times 90 , close , over 225 plus 170 plus 275 plus 500 plus 90 end fraction , row2 column 1 , , column 2 equals . 670 over 1260 . almost equal to , 0.53 , end table

The weighted vote is 53% (or 0.53) in favor, which is short of the required 60%, so the measure does not pass.

Means with Summation Notation (Optional Subsection)

Many statistical formulas, including the formula for the mean, can be written compactly with a mathematical notation called summation notation. The symbol ∑∑ sum (the Greek capital letter sigma) is called the summation sign and indicates that a set of numbers should be added. We use the symbol x to represent each value in a data set, so we write the sum of all the data values as

sum of all values=∑xsum of all values=∑x

sumofallvalues . equals sum x

For example, if a sample consists of 25 exam scores, ∑x∑x sum x represents the sum of all 25 scores. Similarly, if a sample consists of the incomes of 10,000 families, ∑x∑x sum x represents the total dollar value of all 10,000 incomes.

Technical Note

Summations are often written with the use of an index that specifies how to step through the sum. For example, the symbol xixi x sub i end sub indicates the i th data value in the set; the letter i is the index. We then write the sum of all values as

∑i=1nxi∑i=1nxi

sum , from , i equals 1 , to , n , of . x sub i

We read this expression as “the sum of the xixi x sub i values, starting with i=1i=1 i equals 1 and continuing to i=n,i=n, i equals ncomma where n is the total number of data values in the set.” With this notation, the formula for the mean is written

x¯=1n∑i=1nxix¯=1n∑i=1nxi

x bar , equals , 1 over n . sum , from , i equals 1 , to , n , of . x sub i

We use n to represent the total number of values in the sample. Thus, the general formula for the mean is

x¯=sample mean=sum of all valuestotal number of values=∑xnx¯=sample mean=sum of all valuestotal number of values=∑xn

x bar , equals . samplemean . equals . fraction sumofallvalues , over totalnumberofvalues end fraction . equals , fraction sum x , over n end fraction The symbol x¯x¯ x bar is the standard symbol for the mean of a sample. When dealing with the mean of a population rather than a sample, statisticians instead use the Greek letter μ(mu).μ(mu).mu , open mu close , .

Summation notation also makes it easy to express a general formula for the weighted mean. Again we use the symbol x to represent each data value, and we let w represent the weight of each data value. The sum of the products of each data value and its corresponding weight is ∑(x×w).∑(x×w). sum . open , x times w , close . . The sum of the weights is ∑w.∑w. sum w . Therefore, the formula for the weighted mean is

weighted mean=∑(x×w)∑wweighted mean=∑(x×w)∑w

weightedmean . equals . fraction sum . open , x times w , close , over sum w end fraction

Means and Medians with Binned Data (Optional Subsection)

The ideas of this section can be extended to a frequency table simply by assuming that the middle value in a bin represents all the data values in that bin. For example, consider the following frequency table of 50 binned ages:

10

10

Age (years)

Frequency

0–6

10

7–13

14–20

21

–27

20

The middle value of the first bin is 3, so we assume that the value of 3 occurs 10 times. Continuing to use the middle value of each bin, the sum of the 50 values in the table is

(3×10)+(10×10)+(17×10)+(

24

×20)=

78

0

4 Describing Data
 > 
4.1 What Is Average?
 > 

Section 4.1 Exercises

The mean is therefore 780/50=15.

6.7

80/50=1

5.6

. 780 slash 50 equals , 15.6 , . years. With 50 values, the median is between the 25th and 26th values. These values fall within the bin 14–20, so we call this bin the median class for the data. The mode is the bin with the highest frequency—the bin 21–27 in this case.

Section 4.1 Exercises

Statistical Literacy and Critical Thinking

· 1. Average. Define and distinguish among mean, median, and mode.

· 2. Outliers. What are outliers? Describe the effects of outliers on the mean, median, and mode.

· 3.

Average Confusion. Briefly describe at least two possible sources of confusion about the “average.”

· 4. Weighting. What is a weighted mean, and when is it appropriate to use one?

Does It Make Sense? For Exercises 5–9, determine whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer.

· 5. Soccer Salary. A survey found that the mean salary for professional soccer players is much higher than the median salary.

· 6. Births. A study involves a large number of families with exactly four children. After finding that the mean number of girls is

1.9

7, researchers rounded the result to 2 because it impossible to get 1.97 girls in a family of four children.

· 7. Mode. In an analysis of salaries paid to sales personnel in a company, it is found that those salaries have modes of $75,000 and $95,000.

· 8. Employment Data. A survey asked people their working status, and coded the responses as follows: 1=Working; 2=With a job but not at work; 3=Looking for work; 4=Not working; 7=Refused to answer; 9=Don’t know.1=Working; 2=With a job but not at work; 3=Looking for work; 4=Not working; 7=Refused to answer; 9=Don’t know. 1 equals , cap working , semicolon 2 equals . cap withajobbutnotatwork . semicolon 3 equals . cap lookingforwork . semicolon 4 equals . cap notworking . semicolon 7 equals . cap refusedtoanswer . semicolon 9 equals cap don apostrophe , tknow , . I analyzed the results and found that they have a mean of 2.6.

· 9. Mean Wage. To find the mean wage of restaurant workers in the United States, an economist found the mean wage for each of the 50 states, then calculated the mean of those 50 numbers.

Concepts and Applications

Mean, Median, and Mode. In Exercises 10–16, find the mean, median, and mode, and answer any additional questions.

· 10. Top 5 Celebrity Incomes. Listed below are the earnings (in millions of dollars) of the celebrities with the five highest incomes in a recent year. The celebrities in order are Steven Spielberg, Howard Stern, George Lucas, Oprah Winfrey, and Jerry Seinfeld. Can this “Top 5” list be used to learn anything about the mean annual earnings of all celebrities?

225

33

2

302

235

100

·

11

. Car Crash Test Measurements. Listed below are measurements of the “head injury criterion (hic)” for seven small cars tested in crashes by the National Highway Traffic Safety Administration: Chevrolet Aveo, Honda Civic, Volvo S40, VW Jetta, Hyundai Elantra, Kia Rio, and Kia Spectra. Higher numbers are associated with a higher risk of injury. Can you draw any conclusion about the risk of head injury in small cars versus larger cars?

3

71

356

393

5

44

326

52

0

501

· 12. Lead in Medicine. Ayurveda is a traditional medical system commonly used in India. Listed below are the lead concentrations (in μg/g)(in μg/g) open in mu gslashg close measured in different Ayurveda medicines (manufactured in the United States). (Data from “Lead, Mercury, and Arsenic in U.S. and Indian Manufactured Ayurvedic Medicines Sold via the Internet,” by Saper et al., Journal of the American Medical Association, Vol. 300, No. 8.) What do the decimal values of the listed amounts suggest about the precision of the measurements?

20.5

3.0

6.5

6.0

5.5

20.5

7.5

12.0

11.5

17.5

· 13. Saints in Super Bowl. Listed below are the numbers on the jerseys of the starting lineup for the New Orleans Saints when they won their first Super Bowl football game. Are the mean, median, and mode for these data meaningful?

9 23 25 88 12

19

74

77

76

73

78

· 14. Perception of Time. The data below are times (in seconds) recorded when statistics students participated in an experiment to test their ability to determine when 1 minute (60 seconds) had passed. What do these data suggest about students’ perception of time?

49

53 52 75

62

68

58

49

· 15. Body Temperatures. Below are body temperatures (in °F°F degrees cap f) of randomly selected normal, healthy adults. What do the results suggest about “normal” body temperature?

98.6

98.0

98.4

98.4

98.4

98.6

98.6

98.0

99.0

98.4

· 16. Blood Alcohol and Driving. Below are measurements of blood alcohol concentration (BAC) for randomly selected drunk drivers involved in fatal crashes and then given jail sentences (based on data from the U.S. Department of Justice). Are these data consistent with laws stating that it is illegal to drive with a BAC level above

0.08

?

0.17

0.24

0.16

0.16

0.27

0.17

0.16

0.13

0.24

0.29

0.14

0.12

·

· 17. Cell Phone Radiation. Listed below are measurements of specific absorption rate (SAR) of radiation (in W/kg) when these types of cell phones are held to the head: Samsung SGH-tss9, Blackberry Storm, Blackberry Curve, Motorola Moto, T-Mobile Sidekick, Sanyo Katana Eclipse, Palm Pre, Sony Ericsson, Nokia 6085, Apple iPhone 3G S, and Kyocero Neo E1100. (Data from the Environmental Working Group.) What additional data would you need to determine whether there is any danger from cell phone radiation?

0.38

0.55

1.

54

1.55

0.50

0.60

0.92

0.96

1.00

0.86

1.46

·

18

. Alphabetic States. The states of Alabama, Alaska, Arizona, Arkansas, California, Colorado, and Connecticut have the following areas in square miles (land and water), with the areas listed in the same order as the states:

52,

200

61

5,200

114,000

53,200

158,900

104,100

5,500

· a. Find the mean area and median area for these states.

· b. Which state is an outlier on the high end? If you eliminate this state, what are the new mean and median areas for this data set?

· c. Which state is an outlier on the low end? If you eliminate this state, what are the new mean and median areas for this data set?

· 19. Outlier Coke. The contents of cans of regular Coca-Cola vary slightly in weight. Here are the measured weights of seven cans, in pounds:

0.8161

0.81

94

0.8165

0.8176

0.7901

0.81

43

0.8126

 

· a. Find the mean and median of these weights.

· b. Which, if any, of these weights would you consider to be an outlier? Explain.

· c. What are the mean and median weights if the outlier is excluded?

· 20. Raising Your Grade. Suppose you have scores of 80, 84, 87, and 89 on quizzes in a mathematics class.

· a. What is the mean of these scores?

· b. What score would you need on the fifth quiz to have an overall mean of 88?

· c. If the maximum score on a quiz is 100, is it possible to have a mean of 90 after the fifth quiz? Explain.

· 21. Raising Your Grade. Suppose you have scores of 60, 70, 65, 85, and 85 on exams in a sociology class.

· a. What is the mean of these scores?

· b. What score would you need on the next exam to have an overall mean of 75?

· c. If the maximum score on an exam is 100, what is the maximum mean score that you could possibly have after the next exam? Explain.

Comparing Data. In Exercises 22–25, find the mean and median for each of the two samples, then compare the two sets of results.

· 22. Speeding and Race. Listed below are speeds (in mi/h) of cars on the New Jersey Turnpike, where the speed limit is 65 mi/h. All cars are going in the same direction, and all of the cars are from New Jersey. The speeds were measured with a radar gun, and the researchers observed the races of the drivers. (The data are from Statlib and were published by Joseph Kadane and John Lamberth.) Does it appear that drivers of either race speed more than drivers of the other race?

74

77

77

69

75

74

72

71

76

76

74

71

75

74

74

White

driver

69

71

72

African-American driver

79 70

· 23. Parking Meter Theft. Listed below are amounts (in millions of dollars) collected from parking meters by Brinks and others in New York City during similar time periods. A larger data set was used to convict five Brinks employees of grand larceny. (The data were provided by the attorney for New York City and are listed on the DASL website.) Do the limited data listed here show evidence of stealing by Brinks employees?

1.3

1.5

1.7

1.7

1.5

1.6

1.5

1.7

1.9

1.6

1.6

1.8

Collection contractor was Brinks

1.3 1.5 1.4 1.7

1.8

1.6

Collection contractor was not Brinks

2.2

1.9

· 24. Political Contributions. Listed below are randomly selected contributions (in dollars) made to the two presidential candidates in a recent election. All contributions are from the same ZIP code. (The data are from the Huffington Post.) Do the contributions appear to favor either candidate? What do you conclude after further learning that there were 66 contributions to the Democratic candidate and 20 contributions to the Republican?

275

1000

500

100

235

75

302

250

350

500

500

500

Democrat

452

300

1000

1061

1200

875

2000

350

210

250

Republican

50

240

700

1250

150

0

40

221

400

· 25. Customer Waiting Times. Waiting times (in minutes) of customers at the Jefferson Valley Bank (where all customers enter a single waiting line) and the Bank of Providence (where customers wait in individual lines at three different teller windows) are listed below. Determine whether there is a difference between the two data sets that is not apparent from a comparison of the means and medians. If so, what is it?

6.5

7.7

7.7

6.7

7.7

7.7

Jefferson Valley (single line)

6.6

6.7

6.8

7.1

7.3

7.4

7.7

Providence (individual lines)

4.2

5.4

5.8

6.2

8.5

9.3

10.0

Weighted Mean. Compute the weighte means in Exercises 26–

31

.

· 26. Final Grade. Your course grade is based on one midterm that counts as 15% of your final grade, one class project that counts as 20% of your final grade, a set of homework assignments that counts as 40% of your final grade, and a final exam that counts as

25%

of your final grade. Your midterm score is 75, your project score is 90, your homework score is 85, and your final exam score is 72. What is your overall final score?

· 27. Class Grade. Ryan is taking an advanced math class in which exams are worth

70%

of the final grade, homework is worth 20%, and quizzes are worth 10%. On a 100-point scale, his exam scores average 89.5, his homework averages 94.1, and his quiz average is 85. What is his overall score for the class?

·

28

. GPA. One common system for computing a grade point average (GPA) assigns 4 points to an A, 3 points to a B, 2 points to a C, 1 point to a D, and 0 points to an F. What is the GPA of a student who gets an A in a 4-credit course, a B in each of two 3-credit courses, and a C in a 1-credit course? Round to two decimal places.

· 29. GPA. In one semester, a student completed the 14 credits listed in the table below. Grades are weighted as follows: A=4.0;A−=3.7;B+=3.4;B=3.0;B−=2.7.A=4.0;A−=3.7;B+=3.4;B=3.0;B−=2.7. cap a equals 4.0 semicolon cap a minus equals 3.7 semicolon cap b plus equals 3.4 semicolon cap b equals 3.0 semicolon cap b minus equals 2.7 . Find the student’s GPA for the semester. Round to two decimal places.

3

3

B

A

Course

Credits

Grade

Math 108

3

A−A−cap a minus

Comparative Religion 121

B+B+cap b plus

Spanish 321

4

B−B−cap b minus

Astronomy 111

Astronomy Lab 112

1

· 30. Stockholder Voting. A small company has four stockholders who vote on a new advertising campaign. The number of shares owned and the vote of each stockholder are given in the table below. Does the new campaign receive a majority of the vote by shares?

Stockholder

Shares owned

Vote

A

400

Y

B

300

N

C

N

D

100

N

200

· 31. Stockholder Voting. A small company has six stockholders who vote on opening a new store. The number of shares owned and the vote of each stockholder are given in the table below. Does the new store receive a majority of the vote by shares?

Stockholder

Shares owned

Vote

A

1000

Y

B

N

C

2000

Y

D

1000

N

E

500

Y

500

N

3000

F

Weighted Mean. In Exercises 32–35, find the mean of the data summarized in the frequency table by using the middle of each bin and the frequency for each bin. Also, compare the computed means to these the actual means obtained using the original list of data values: (Exercise 32) 36.2 years; (Exercise 33) 44.1 years; (Exercise

34

) 224.3(1000 cells/
µ
L);224.3(1000 cells/µL); 224.3 , open , 1000 . cells , slash micro sign cap l close semicolon  (Exercise 35) 255.1(1000 cells/
µ
L).255.1(1000 cells/µL).255.1 , open , 1000 . cells , slash micro sign cap l close .

· 32.

Frequency

1

1

Age (years) of female actor when Oscar was won

20–29

31

30–39

34

40–49

14

50–59

2

60–69

6

70–79

80–89

· 33.

Frequency

20–29

1

30–39

31

40–49

50–59

13

60–69

6

70–79

1

Age (years) of male actor when Oscar was won

37

· 34.

Frequency

1

90

10

0

1

Blood platelet count of males (1000 cells/μL)(1000 cells/μL)open , 1000 . cells , slash mu cap l close

0–99

100–199

51

200–299

300–399

400–499

0

500–599

600–699

· 35.

Frequency

100–199

25

200–299

300–399

400–499

0

500–599

2

Blood platelet count of females (1000 cells/μL)(1000 cells/μL)open , 1000 . cells , slash mu cap l close

92
28

· 36. U.S. Population Center. Imagine taking a huge flat map of the United States and placing weights on it to represent where people live. The point at which the map would balance is called the mean center of population. Figure 4.2 shows how the location of the mean center of population has shifted from 1790 to 2010. Briefly explain the pattern shown on this map.

Figure 4.2

Source: Statistical Abstract of the United States.

 PROJECTS FOR THE INTERNET & BEYOND

· 37. Salary Data. Many websites offer data on salaries in different careers. Find salary data for a career you are considering. What are the mean and median salaries for this career? How do these salaries compare with those of other careers that interest you?

· 38. Is the Median the Message? Read the article “The Median Isn’t the Message,” by Stephen Jay Gould, which you can find by searching for its title or author on the Web. Write a few paragraphs in which you describe the message that Gould was trying to get across. How is this message important to other patients diagnosed with cancer?

· 39. Navel Data. Bin the data collected in Exercise 25 of Section 3.1. Then make a frequency table, and draw a histogram of the distribution. What is the mean of the distribution? What is the median of the distribution? An old theory says that, on average, the navel ratio of humans is the golden ratio: (1+5–√)/2.(1+5)/2. open , 1 plus square root of 5 , close . slash 2 . Does this theory seem accurate based on your observations?

4 Describing Data
 > 

4.2 Shapes of Distributions

IN THE NEWS

· 40. Daily Averages. Cite three examples of averages that you deal with in your own life (such as grade point average or batting average). In each case, explain whether the average is a mean, a median, or some other type of average. Briefly describe how the average is useful to you.

· 41. Averages in the News. Find three recent news articles that refer to some type of average. In each case, explain whether the average is a mean, a median, or some other type of average.

4.2 Shapes of Distributions

In the previous section, we discussed how to describe the center of a collection of quantitative data with measures such as the mean and median. We now turn our attention to the overall shape of a distribution, which we often describe with three characteristics that we’ll discuss in this section: its number of modes, its symmetry or skewness, and its variation. Because we are interested in the general shapes of distributions, it’s easier to focus on smooth curves that fit the data, rather than dealing with the actual data. Figure 4.3 shows three examples of this idea, two in which the distributions are shown as histograms and one in which the distribution is shown as a line chart. In each case, the smooth curves make good approximations to the original distributions.

Figure 4.3 The smooth curves approximate the shapes of the distributions.

Number of Modes

One simple way to describe the shape of a distribution is by its number of peaks, or modes. Figure 4.4 shows four common types of distributions characterized by different numbers of modes, and the following definitions box summarizes these four types. Note that, by convention, any peak in a distribution is considered a mode, even if not all peaks have the same height. For example, the distribution in Figure 4.4c is said to have two modes, even though the second peak is lower than the first.

Figure 4.4 Distributions with different numbers of modes.

4 Describing Data

 > 

4.2 Shapes of Distributions

 > 

Symmetry or Skewness

Definitions

A uniform distribution has no mode because all data values have the same frequency.

A single-peaked (or unimodal) distribution has one mode.

A bimodal distribution has two modes.

A trimodal distribution has three modes.

Example 1 Number of Modes

How many modes would you expect for each of the following distributions? Why? Make a rough sketch for each distribution, with clearly labeled axes.

· a. Heights of 1000 randomly selected adult women

· b. Hours spent watching football on TV in January for 1000 randomly selected adult Americans

· c. Weekly sales throughout the year at a retail clothing store for children

· d. The numbers of people with particular last digits (0 through 9) in their Social Security numbers

SOLUTION

Figure 4.5 shows sketches of the distributions.

· a. The distribution of heights of women is single-peaked because many women are at or near the mean height, with fewer and fewer women at heights much greater or less than the mean.

· b. The distribution of times spent watching football on TV for 1000 randomly selected adult Americans is likely to be bimodal (two modes). One mode represents the mean watching time of big fans, and the other represents the mean watching time of more casual fans.

· c. The distribution of weekly sales throughout the year at a retail clothing store for children is likely to have several modes. For example, it will probably have a mode in spring for sales of summer clothing, a mode in late summer for back-to-school sales, and another mode in winter for holiday sales.

· d. The last digits of Social Security numbers are essentially random, so the number of people with each different last digit (0 through 9) should be about the same. That is, about 10% of all Social Security numbers end in 0, 10% end in 1, and so on. It is therefore a uniform distribution with no mode.

 

d

Figure 4.5 Sketches for Example 1.

Symmetry or Skewness

A second simple way to describe the shape of a distribution is in terms of its symmetry or skewness.

A distribution is symmetric if its left half is a mirror image of its right half.

The distributions in Figure 4.6 are all symmetric. The symmetric distribution in Figure 4.6a, with a single peak and a characteristic bell shape, is known as a normal distribution. This is the most important distribution in statistics, and we will devote Chapter 5 to it.

4 Describing Data
 > 
4.2 Shapes of Distributions
 > 
Symmetry or Skewness

Figure 4.6 These distributions are all symmetric because their left halves are mirror images of their right halves. Note that (a) and (b) are single-peaked (unimodal), and (c) is triple-peaked (trimodal).

A distribution that is not symmetric must have values that tend to be more spread out on one side than on the other. In this case, we say that the distribution is skewed. Figure 4.7a shows a distribution in which the values are more spread out on the left, meaning that some values are outliers at low values. We say that such a distribution is left-skewed (or negatively skewed), because it looks as if it has a tail that has been pulled toward the left. Figure 4.7b shows a distribution in which the values are more spread out on a tail extending to the right, meaning that some values are outliers at high values. This distribution is right-skewed (or positively skewed).

 
d

Figure 4.7 (a) Skewed to the left: The mean and median are less than the mode. (b) Skewed to the right: The mean and median are greater than the mode. (c) Symmetric distribution (single-peaked): The mean, median, and mode are the same.

Figure 4.7 also shows how skewness affects the relative positions of the mean, median, and mode. By definition, the mode is at the peak in a single-peaked distribution. A left-skewed distribution pulls both the mean and median to the left of the mode, meaning to values less than the mode. In addition, outliers at the low end of the data set make the mean less than the median (see Table 4.2 on 

page 121

). Similarly, a right-skewed distribution pulls the mean and median to the right of the mode, and the outliers at the high end of the data set make the mean greater than the median. When the distribution is symmetric and single-peaked, both the mean and the median are equal to the mode.

Definitions
A distribution is symmetric if its left half is a mirror image of its right half.

A distribution is left-skewed if its values are more spread out on the left side.

A distribution is right-skewed if its values are more spread out on the right side.

Technical Note

A left-skewed distribution is also said to be negatively skewed, and a right-skewed distribution is also referred to as positively skewed. A symmetric distribution has zero skewness.

4 Describing Data

 > 

4.2 Shapes of Distributions

 > 

Variation

THINK ABOUT IT Which is a better measure of the “average” (or of the center of the distribution) for a skewed distribution: the median or the mean? Why?

By the Way

As of 2016, the median family income in the United States was about $54,000, while the mean family income was about $77,000.

Example 2 Skewness

For each of the following situations, state whether you expect the distribution to be symmetric, left-skewed, or right-skewed. Explain.

· a. Heights of a sample of 100 women

· b. Family incomes in the United States

· c. Speeds of cars on a road where a visible patrol car is using radar to detect speeders

SOLUTION

· a. The distribution of heights of women is symmetric, because roughly equal numbers of women are shorter and taller than the mean and extremes of height are rare on either side of the mean.

· b. The distribution of family incomes is right-skewed. Most families are middle-class, so the mode of this distribution is a middle-class income. But a few very high-income families pull the mean to a considerably higher value, stretching the distribution to the right (high-income) side.

· c. Drivers usually slow down when they are aware of a patrol car looking for speeders. Few if any drivers will be exceeding the speed limit, but some drivers tend to slow well below the speed limit. Therefore, the distribution of speeds is left-skewed, with a mode near the speed limit but a few cars going well below the speed limit.

By the Way

Speed kills. On average, in the United States, someone is killed in a car crash about every 16 minutes. About one-third of these fatalities involve driving at excessive speed.

THINK ABOUT IT In ordinary English, the term skewed is often used to mean something that is distorted or depicted in an unfair way. How is this use of skewed related to its meaning in statistics?

Variation

A third way to describe a distribution is by its variation, which is a measure of how much the data values are spread out. A distribution in which most data are clustered together has a low variation. As shown in Figure 4.8a, such a distribution has a fairly sharp peak. The variation is higher when the data are distributed more widely around the center, which makes the peak broader. Figure 4.8b shows a distribution with a moderate variation, and Figure 4.8c shows a distribution with a high variation. We’ll discuss methods for describing the variation quantitatively in the next section.

Figure 4.8 From left to right, these three distributions have increasing variation.

Definition

Variation describes how widely data are spread out from the center of a data set.

4 Describing Data

 > 

4.2 Shapes of Distributions

 > 

Section 4.2 Exercises

Example 3 Variation in Marathon Times

How would you expect the variation to differ between times in an Olympic marathon and times in the New York City marathon? Explain.

SOLUTION

An Olympic marathon involves only elite runners, whose times are likely to be clustered relatively near world-record times. The New York City marathon allows runners of all abilities, whose times are spread over a very wide range (from near the world record of just over two hours to many hours). Therefore, the variation among the times is expected to be greater in the New York City marathon than in an Olympic marathon.

Section 4.2 Exercises

Statistical Literacy and Critical Thinking

· 1. Modes. Distinguish between a uniform distribution and a distribution with one or more modes. What do we call distributions with one, two, and three modes?

· 2. Symmetry. What do we mean when we say that a distribution is symmetric? Can a symmetric distribution have more than one mode?

· 3. Skewness. What do we mean when we say that a distribution is skewed? Briefly describe the basic difference between a distribution that is skewed to the right and a distribution that is skewed to the left.

· 4. Variation. What do we mean by the variation of values in a set of data?

Does It Make Sense? For Exercises 5–8, determine whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer.

· 5. Skewed. The distribution of grades in my class was left-skewed, but the mean, median, and mode were all the same.

· 6. Bimodal. The distribution of pedestrian speeds on a path used by both walkers and runners was found to be bimodal.

· 7. Income Variation. Our data show that incomes at age 35 for 1000 randomly selected adults have more variation than incomes at age 35 for 1000 randomly selected high school teachers.

· 8. Age Variation. There’s much more variation in the ages of the general population than in the ages of students in my college extension course.

Concepts and Applications

Distributions. In Exercises 9–12, describe the histogram in terms of symmetry and skewness.

· 9. Tornadoes. The histogram in Figure 4.9 shows the F-scale measurements of intensities of

490

tornadoes from recent years. (The F-scale runs from 0 to 5; a higher number indicates a stronger tornado.)

Figure 4.9

· 10. Arm Circumference. Figure 4.10 is a histogram of measured mid–upper arm circumferences of 300 randomly selected adults.

 
d

· 11. Lottery Numbers. Figure 4.11 is a histogram of digits drawn in California’s Daily Four lottery.

Figure 4.11

· 12. Old Faithful. The histogram in Figure 4.12 shows duration times of eruptions of the Old Faithful geyser in Yellowstone National Park.

Figure 4.12

Source: Hand et al., Handbook of Small Data Sets.

· 13. Baseball Salaries. In a recent year, the 868 professional baseball players had salaries with the following characteristics:

· • The mean was $4,214,614.

· • The median was $1,650,000.

· • The salaries ranged from a low of $507,500 to a high of $31,000,000.

· a. Describe the shape of the distribution of salaries. Is the distribution symmetric? Is it left-skewed? Is it right-skewed?

· b. About how many players had salaries of $1,650,000 or higher?

· 14. Boston Rainfall. The daily rainfall amounts (in inches) for Boston in a year consist of 365 values with these properties:

· • The mean daily rainfall amount is 0.083 inch.

· • The median of the daily rainfall amounts is 0 inches.

· • The minimum daily rainfall amount is 0 inches and the maximum is 1.48 inches.

· a. How is it possible that the minimum of the 365 values is 0 inches and the median is also 0 inches?

· b. Describe the distribution as symmetric, left-skewed, or right-skewed.

· c. Can you determine the exact number of days that it rained? Can you conclude anything about the number of days that it rained? Explain.

Describing Distributions. For each distribution described in Exercises 15–26, answer the following questions:

· a. How many modes would you expect for the distribution?

·
b. Would you expect the distribution to be symmetric, left-skewed, or right-skewed?

· 15. Incomes. The annual incomes of all those in a statistics class, including the instructor

· 16. Braking Reaction Times. The braking reaction times of 500 randomly selected drivers, measured under standard conditions

· 17. Heights. The heights of 250 randomly selected female statistics students

· 18. Heights. The heights of 500 female students, half of whom are college students while the other half are second-grade students

· 19. Weights of AAA Batteries. The weights of 1000 AAA batteries of a single brand

· 20. Passenger Weights. The weights of 240 adult airline passengers, consisting of 80 females and

160

males

· 21. Patients. The ages of 1000 randomly selected patients being treated for dementia

· 22. Speeds. The speeds of drivers on a highway in Montana

· 23. Patron Ages. The ages of adults who visit the National Air and Space Museum

· 24. Ages at Death. The ages of adults at the time of their death

· 25. Net Worth. The net worth (in dollars) of female movie actors

· 26. Myocardial Infarctions. The ages of males when they were hospitalized because of myocardial infarctions (heart attacks)

 PROJECTS FOR THE INTERNET & BEYOND

· 27. New York Marathon. The website for the New York City marathon gives frequency data for finish times in the most recent marathon. Study the data, make a rough sketch of the distribution, and describe the shape of the distribution in words.

· 28. Tax Stats. The IRS website provides statistics collected from tax returns on income, refunds, and much more. Choose a set of statistics from this website and study the distribution. Describe the distribution in words, and discuss anything you learn that is relevant to national tax policies.

· 29. Social Security Data. Survey a sample of fellow students, asking each to indicate the last digit of her or his Social Security number. Also ask each participant to indicate the fifth digit. Draw one graph showing the distribution of the last digits and another graph showing the distribution of the fifth digits. Compare the two graphs. What notable difference becomes apparent?

4 Describing Data
 > 

4.3 Measures of Variation

IN THE NEWS

· 30. Distributions in the News. Find three recent examples in newspapers or news magazines of distributions shown as histograms or line charts. Over each distribution, draw a smooth curve that captures its general features. Then classify the distribution according to its number of modes, symmetry or skewness, and variation.

· 31. Trimodal Distribution. Give an example of a real distribution that you expect to have three modes. Make a rough sketch of the distribution; be sure to label the axes on your sketch.

· 32. Skewed Distribution. Give an example of a real distribution that you would expect to be either right- or left-skewed. Make a rough sketch of the distribution; be sure to label the axes on your sketch.

4.3 Measures of Variation

In Section 4.2, we saw how to describe variation qualitatively by looking at the shape of a distribution. In this section, we present quantitative measures of variation.

Why Variation Matters

Imagine customers waiting in line for tellers at two different banks. Customers at

Big Bank

can enter any one of three different lines leading to three different tellers.

Best Bank

also has three tellers, but all customers wait in a single line and are called to the next available teller. The following values are waiting times, in minutes, for 11 customers at each bank. The times are arranged in ascending order.

We mortals cross the ocean of this world,

Each in his average cabin of a life.

—Robert Browning

6.2

6.7

7.7

7.7

8.5

9.3

6.6

6.7

6.7

7.1

7.2

7.3

7.4

7.7

7.8

Big Bank (three lines)

4.1

5.2

5.6

7.2

11.0

Best Bank (one line)

6.9

7.8

You’ll probably find more unhappy customers at Big Bank than at Best Bank, but this is not because the average wait is any longer. In fact, you can verify for yourself that the mean and median waiting times are 7.2 minutes at both banks. The difference in customer satisfaction comes from the variation at the two banks. The waiting times at Big Bank vary over a fairly wide range, so a few customers have long waits and are likely to become annoyed. In contrast, the variation of the waiting times at Best Bank is small, so customers probably feel that they are being treated roughly equally. Figure 4.13 shows the difference in the two variations with histograms in which the data values are binned to the nearest minute.

By the Way

The idea of waiting in line (or queuing) is important not only for people but also for data, particularly for data streaming through the Internet. Major corporations often employ statisticians to help them make sure that data move smoothly and without bottlenecks through their servers and to and from their websites.

THINK ABOUT IT Explain why Big Bank, with three separate lines, has a greater variation in waiting times than Best Bank. Then consider several places where you commonly wait in lines, such as a grocery store, a theme park ride, or a fast-food restaurant. Do these places use a single customer line that feeds multiple clerks or multiple lines? If multiple lines are used, do you think a single line would be better? Explain.

 d

Figure 4.13 Histograms for the waiting times at Big Bank and Best Bank, shown with data binned to the nearest minute.

4 Describing Data
 > 

4.3 Measures of Variation

 > 

Range

Range

The simplest way to describe the variation of a data set is to compute its range, defined as the difference between the highest (maximum) and lowest (minimum) values. For the example of the two banks, the waiting times for Big Bank vary from 4.1 to 11.0 minutes, so the range is 11.0−4.1=6.9 minutes.11.0−4.1=6.9 minutes. 11.0 , minus 4.1 equals 6.9 . minutes. The waiting times for Best Bank vary from 6.6 to 7.8 minutes, so the range is 7.8−6.6=1.2 minutes.7.8−6.6=1.2 minutes. 7.8 minus 6.6 equals 1.2 . minutes. The range for Big Bank is much larger, reflecting its greater variation in waiting times.

Definition

The range of a set of data values is the difference between its highest and lowest data values:

range=highest value (max)−lowest value (min)range=highest value (max)−lowest value (min)

range , equals . highestvalue . , open mehx close , minus . lowestvalue . , open min close

Although the range is easy to compute and can be useful, it occasionally can be misleading, as the next example shows.

Example 1 Misleading Range

Consider the following two sets of quiz scores for nine students. Which set has the greater range? Would you also say that this set has the greater variation?

1

10

10

10

10

10

10

10

10

2

3

4

5

6

7

8

9

10

Quiz 1

Quiz 2

SOLUTION

The range for Quiz 1 is 10−1=910−1=9 10 minus 1 equals 9 points, which is greater than the range for Quiz 2 of 10−2=810−2=8 10 minus 2 equals 8 points. However, aside from a single low score (an outlier), Quiz 1 has no variation at all because every other student got a 10. In contrast, no two students got the same score on Quiz 2, and the scores are spread throughout the list of possible scores. Quiz 2 therefore has greater variation in scores even though Quiz 1 has the greater range.

4 Describing Data
 > 
4.3 Measures of Variation
 > 

Quartiles and the Five-Number Summary

Quartiles and the Five-Number Summary

A better way to describe variation is to consider a few intermediate data values in addition to the high and low values. A common way involves looking at the quartiles, or values that divide the data distribution into quarters. The following list repeats the waiting times (in minutes) at the two banks, with the quartiles shown in bold. Note that the middle quartile, which divides the data set in half, is simply the median.

 

 

 

 

 

 

 

 

 

4.1

5.2

5.6

6.2

6.7

7.2

7.7

7.7

8.5

9.3

11.0

6.6

6.7

6.7

6.9

7.1

7.2

7.3

7.4

7.7

7.8

7.8

Lower quartile (Q1)(Q1)open , q sub 1 , close

Median (Q2)(Q2)open , q sub 2 , close

Upper quartile (Q3)(Q3)open , q sub 3 , close

Big Bank

Best Bank

Definitions

The lower quartile (or first quartile, or Q1Q1 q sub 1) divides the lowest fourth of a data set from the upper three-fourths. It is the median of the data values in the lower half of a data set. (Exclude the middle value in the data set if the number of data points is odd.)

The middle quartile (or second quartile, or Q2Q2 q sub 2) is the median.

The upper quartile (or third quartile, or Q3Q3 q sub 3) divides the lowest three-fourths of a data set from the upper fourth. It is the median of the data values in the upper half of a data set. (Exclude the middle value in the data set if the number of data points is odd.)

Technical Note

Statisticians do not universally agree on the procedure for calculating quartiles, and different procedures and different technologies can result in slightly different quartile values. For example, using the data from smokers in Example 2 (later in this section), Statdisk and the TI-83/84 Plus calculator provide a first quartile of

12.58

, Minitab yields 12.2, and XLSTAT yields 12.868.

Once we know the quartiles, we can describe a distribution with a five-number summary, consisting of the low value, the lower quartile, the median, the upper quartile, and the high value. For the waiting times (in minutes) at the two banks, the five-number summaries are as follows:

median =7.2median =7.2median , equals 7.2

Big Bank Best Bank

low =4.1low =4.1low , equals 4.1

low =6.6low =6.6low , equals 6.6

lower quartile =5.6lower quartile =5.6lowerquartile . equals 5.6

lower quartile =6.7lower quartile =6.7lowerquartile . equals 6.7

median =7.2median =7.2median , equals 7.2

upper quartile =8.5upper quartile =8.5upperquartile . equals 8.5

upper quartile =7.7upper quartile =7.7upperquartile . equals 7.7

high =11.0high =11.0high , equals , 11.0

high =7.8high =7.8high , equals 7.8

The Five-Number Summary

The five-number summary for a data distribution consists of the following five numbers:

lowest value (minimum)

lower quartile (Q1)(Q1)open , q sub 1 , close

median (Q2)(Q2)open , q sub 2 , close

upper quartile (Q3)(Q3)open , q sub 3 , close

highest value (maximum)

We can display the five-number summary with a graph called a boxplot (or box-and-whisker plot). Using a number line for reference, we enclose the values from the lower to the upper quartiles in a box. We then draw a line through the box at the median and add two “whiskers,” extending from the box to the low and high values. Figure 4.14 shows boxplots for the bank waiting times. Both the box and the whiskers for Big Bank are broader than those for Best Bank, indicating that the waiting times have greater variation at Big Bank.

4 Describing Data
 > 
4.3 Measures of Variation
 > 
Quartiles and the Five-Number Summary

Figure 4.14 Boxplots show that the variation of the waiting times is greater at Big Bank than at Best Bank.

Drawing a Boxplot

· Step 1. Draw a number line that spans all the values in the data set.

· Step 2. Enclose the values from the lower quartile (Q1)(Q1) open , q sub 1 , close to the upper quartile (Q3)(Q3) open , q sub 3 , close in a box. (The thickness of the box has no meaning.)

· Step 3. Draw a line through the box at the median.

· Step 4. Add “whiskers” extending to the low and high values.

Technical Note

The boxplots shown in this text are called skeletal boxplots. Some boxplots are drawn with outliers marked by an asterisk (*) or a dot and the whiskers extending only to the smallest and largest data values that are not outliers; these types of boxplots are called modified boxplots.

 Using Technology: Quartiles

EXCEL The built-in function QUARTILE.INC (or QUARTILE) returns the quartile values for a range of sample data. This function uses the format =QUARTILE.INC(array,quart).=QUARTILE.INC(array,quart). equals . cap qcap ucap acap rcap tcap icap lcap e.cap icap ncap c . open , array , comma , quart , close . For array, enter the range of cells containing the data; for quart, enter 1 for the first quartile, 2 for the second quartile (or median), and 3 for the third quartile. The screen shot below shows the process for the Big Bank data, with the functions shown in Column B and the results in Column C. Note that the quartile values provided by Excel for the Big Bank data differ slightly from the values obtained using the method described in this section.

Source: MS Excel 2016.

Alternatively, the XLSTAT add-in that is a supplement to this text can be used. Use the procedure detailed in Section 4.1 (click on XLSTAT, click on Describing Data, and then select Descriptive Statistics). The quartile values are provided in the results.

STATDISK Enter the data in the Sample Editor or open an existing data set. Next, on Data and then select Explore Data – Descriptive Statistics. Select the data column to explore and then click Evaluate to get the various descriptive statistics, including the quartile values.

TI-83/84 PLUS Plus Quartiles are included in the descriptive statistic obtained by using the procedure detailed in Section 4.1 (press STAT,STAT, begin box , select CALC, select 1-Var Stats, and enter the list name). Q1Q1 q sub 1 indicates the first quartile, Med indicates the second quartile, and Q3Q3 q sub 3 indicates the third quartile.

4 Describing Data
 > 
4.3 Measures of Variation
 > 
Quartiles and the Five-Number Summary

By the Way

Passive smoke is particularly harmful to young children, because the toxins in cigarette smoke have a greater effect on developing bodies than on full-grown adults. A similar effect is found for most other toxins, which is why it is especially important to limit children’s exposure to toxic chemicals.

Example 2 Passive and Active Smoke

One way to study exposure to cigarette smoke is by measuring blood levels of serum cotinine, a metabolic product of nicotine that the body absorbs from cigarette smoke. Table 4.3 lists serum cotinine levels from samples of 50 smokers (exposed to active smoke) and 50 nonsmokers who were exposed to cigarette smoke at home or at work (exposed to passive smoke). Compare the two data sets (smokers and nonsmokers) with five-number summaries and boxplots, and discuss your results.

1

2

0.14

3

0.27

0.08

4

0.08

5

6

0.09

7

8

9

0.12

10

0.12

0.14

12

0.17

13

14

0.27

19

20

0.38

0.44

23

0.51

0.51

25

28

31

34

37

40

49

50

TABLE 4.3 Serum Cotinine Levels (in nanograms per milliliter of blood, or ng/ml) in Samples of 50 Smokers and 50 Nonsmokers Exposed to Passive Smoke, with Data Values Listed in Ascending Order

Order number

Smokers

Nonsmokers

0.08

0.03

0.07

0.44

0.51

0.09

1.78

2.55

0.10

3.03

0.11

3.44

4.98

11

6.87

1

1.12

12.58

0.20

13.73

0.23

15

14.42

16

18.22

0.28

17

19.28

0.30

18

20.16

0.33

23.67

0.37

25.00

21

25.39

22

29.41

0.49

30.71

24

32.54

32.56

0.68

26

34.21

0.82

27

36.73

0.97

37.73

1.12
29

39.48

1.23

30

48.58

1.37

51.21

1.40

32

56.74

1.67

33

58.69

1.98

72.37

2.33

35

104.54

2.42

36

114.49

2.66

145.43

2.87

38

187.34

3.13

39

226.82

3.54

267.83

3.76

41

328.46

4.58

42

388.74

5.31

43

405.28

6.20

44

415.38

7.14

45

417.82

7.25

46

539.62

10.23

47

592.79

10.83

48

688.36

17.11

692.51

37.44

983.41

61.33

Note: The column called “Order number” is included to make it easier to read the table.

Source: National Health and Nutrition Examination Survey, National Institutes of Health.

 Using Technology: Boxplots

EXCEL Although Excel itself is not designed to generate a boxplot, you can generate one using XLSTAT, the Excel add-in that is a supplement to this text. Load XLSTAT, then enter or copy the data into a column of the spreadsheet. Click on XLSTAT, click on Visualizing Data, and then select Univariate plots. Select Quantitative Data and enter the range of cells containing the data, such as A1:A11. (If the first cell includes the name of the data, click on the box next to “Sample labels.”) Click OK to continue. The result will be a boxplot with two features: (1) The boxplot will be vertical, and (2) the exact values of the quartiles are likely to be somewhat different from those found using the procedure described above.

STATDISK Enter the data in the Sample Editor, then click on Data, followed by Boxplot. Select the columns that you want to include, then click on Boxplot (or Modified Boxplot).

TI-83/84 PLUS Enter the sample data in list L1 or enter the data and assign them to a list name. Now select STAT PLOTS by pressing 2NDY=.2NDY=. begin box , 2 Press ENTERENTER begin box , to select Plot1, then select the option ON. For a simple boxplot, as described earlier in this section, select the boxplot type on the right; for a modified boxplot, as described in the previous Technical Note, select the boxplot type on the left. For Xlist, enter the list name (L1); for Freq, enter the value 1. Now press ZOOMZOOM begin box , and 99 begin box , 9 , end box for ZoomStat and the boxplot should be displayed. Press TRACETRACE begin box , and use << begin box , less than , end box and >> begin box , greater than , end box to move left or right to view the values of the five-number summary.

4 Describing Data
 > 
4.3 Measures of Variation
 > 
Quartiles and the Five-Number Summary

SOLUTION

The two data sets are already in ascending order, making it easy to construct the five-number summary. Each has 50 data points, so the median lies halfway between the 25th and 26th values. For the smokers, the 25th and 26th values are 32.56 and 34.21, respectively, so the median is

32.56 + 34.212=33.38532.56 + 34.212=33.385

fraction 32.56 , plus , 34.21 , over 2 end fraction . equals , 33.385

For the nonsmokers, the 25th and 26th values are 0.68 and 0.82, respectively, so the median is

0.68+0.822 =0.750.68+0.822 =0.75

fraction 0.68 , plus , 0.82 , over 2 end fraction . equals , 0.75

The lower quartile is the median of the lower half of the values, which is the 13th value in each set. The upper quartile is the median of the upper half of the values, which is the 38th value in each set. The five-number summaries for the two data sets are as follows:

Smokers

Nonsmokers

low=0.08 ng/mllow=0.08 ng/mllow equals , 0.08 . ngslashml

low=0.03 ng/mllow=0.03 ng/mllow equals , 0.03 . ngslashml

lower quartile=12.58 ng/mllower quartile=12.58 ng/mllowerquartile . equals , 12.58 . ngslashml

lower quartile =0.20 ng/mllower quartile =0.20 ng/mllowerquartile . equals , 0.20 . ngslashml

median =33.385 ng/mlmedian =33.385 ng/mlmedian , equals , 33.385 . ngslashml

median =0.75 ng/mlmedian =0.75 ng/mlmedian , equals , 0.75 . ngslashml

upper quartile=187.34 ng/mlupper quartile=187.34 ng/mlupperquartile . equals , 187.34 . ngslashml

upper quartile =3.13 ng/mlupper quartile =3.13 ng/mlupperquartile . equals , 3.13 . ngslashml

high =983.41 ng/mlhigh =983.41 ng/mlhigh , equals , 983.41 . ngslashml

high =61.33 ng/mlhigh =61.33 ng/mlhigh , equals , 61.33 . ngslashml

Figure 4.15 shows boxplots for the two data sets. The boxplots make it easy to see some key features of the data sets. For example, it is immediately clear that the smokers have a higher median level of serum cotinine, as well as a greater variation in levels. We conclude that smokers absorb considerably more nicotine than do nonsmokers exposed to passive smoke. Nevertheless, the levels in the passive smokers are much higher than those found in people who had no exposure to cigarette smoke (as demonstrated by other data, not shown here). Indeed, the nonsmoker with the highest value for serum cotinine has absorbed more nicotine than the median smoker. We conclude that passive smoke can expose nonsmokers to significant amounts of nicotine. Given the known dangers of cigarette smoke, these results give us reason to be concerned about possible health effects from passive smoke.

Figure 4.15 Boxplots for the d
4 Describing Data

 > 

4.3 Measures of Variation

 > 

Percentiles

Percentiles

Quartiles divide a data set into 4 segments. It is possible to divide a data set even more. For example, quintiles divide a data set into 5 segments, and deciles divide a data set into 10 segments. It is particularly common to divide data sets into 100 segments using percentiles. Roughly speaking, the 35th percentile, for example, is a value that separates the bottom 35% of the data values from the top 65%. (More precisely, the 35th percentile is greater than or equal to at least 35% of the data values and less than or equal to at least 65% of the data values.)

Technical Note

As with quartiles, statisticians and various statistics software packages may use slightly different procedures to calculate percentiles, resulting in slightly different values.

If a data value lies between two percentiles, it is common to say that the data value lies in the lower percentile. For example, if you score higher than 84.7% of all people taking a college entrance examination, we say that your score is in the 84th percentile.

Definition

The 
nth percentile of a data set divides the bottom n% of data values from the top (100−n)%.(100−n)%. open , 100 minus n , close . percent . A data value that lies between two percentiles is often said to lie in the lower percentile. You can approximate the percentile of any data value with the following formula:

percentile of data value= number of values less than this data valuetotal number of values in data set ×100percentile of data value= number of values less than this data valuetotal number of values in data set ×100

percentileofdatavalue . equals . fraction numberofvalueslessthanthisdatavalue , over totalnumberofvaluesindataset end fraction . times 100

There are different procedures for finding a data value corresponding to a given percentile, but one approximate approach is to find the Lth value, where L is the product of the percentile (in decimal form) and the sample size. For example, with 50 sample values, the 12th percentile is around the 0.12×50=6th value.0.12×50=6th value.0.12 , times 50 equals 6 . thvalue.

Example 3 Smoke Exposure Percentiles

Answer the following questions concerning the data in Table 4.3.

· a. What is the percentile for the data value of 104.54 ng/ml104.54 ng/ml 104.54 . ngslashml for smokers?

· b. What is the percentile for the data value of 61.33 ng/ml61.33 ng/ml 61.33 . ngslashml for nonsmokers?

· c. What data value marks the 36th percentile for the smokers? For the nonsmokers?

SOLUTION

The following results are approximate.

· a. The data value of 104.54 ng/ml104.54 ng/ml 104.54 . ngslashml for smokers is the 35th data value in the set, which means that 34 data values lie below it. Thus, its percentile is

number of values less than 104.54 ng/mltotal number of values in data set×100=3450×100=68number of values less than 104.54 ng/mltotal number of values in data set×100=3450×100=68

fraction numberofvalueslessthan . 104.54 . ngslashml , over totalnumberofvaluesindataset end fraction . times 100 equals , 34 over 50 , times 100 equals 68

In other words, the 35th data value marks the 68th percentile.

· b. The data value of 61.33 ng/ml61.33 ng/ml 61.33 . ngslashml for nonsmokers is the 50th and highest data value in the set, which means that 49 data values lie below it. Thus, its percentile is

number of values less than 61.33 ng/mltotal number of values in data set×100=4950×100=98number of values less than 61.33 ng/mltotal number of values in data set×100=4950×100=98

fraction numberofvalueslessthan . 61.33 . ngslashml , over totalnumberofvaluesindataset end fraction . times 100 equals , 49 over 50 , times 100 equals 98

In other words, the highest data value in this set lies in the 98th percentile.

· c. Because there are 50 data values in the set, the 36th percentile is around the 0.36×50 =18th value.0.36×50 =18th value. 0.36 , times 50 equals 18 . thvalue. For smokers, this value is 20.16 ng/ml;20.16 ng/ml; 20.16 . ngslashml , semicolon for nonsmokers, it is 0.33 ng/ml.

Standard Deviation

The five-number summary characterizes variation well, but there are advantages to describing variation with a single number. The single number most commonly used to describe variation is called the standard deviation.

4 Describing Data
 > 
4.3 Measures of Variation
 > 
Percentiles

The standard deviation is a measure of how widely data values are spread around the mean of a data set. To calculate a standard deviation, we first find the mean and then find how much each data value “deviates” from the mean. Consider the data sets of waiting times at banks, for which the mean waiting time was 7.2 minutes for both Big Bank and Best Bank. For a waiting time of 8.2 minutes, the deviation from the mean is equal to 8.2 minutes−7.2 minutes=1.0 minute,8.2 minutes−7.2 minutes=1.0 minute, 8.2 . minutes , minus 7.2 . minutes , equals 1.0 . minutecomma meaning that it is 1.0 minute greater than the mean. For a waiting time of 5.2 minutes, the deviation from the mean is equal to 5.2 minutes−7.2 minutes=−2 minutes5.2 minutes−7.2 minutes=−2 minutes 5.2 . minutes , minus 7.2 . minutes , equals negative 2 , minutes (negative 2 minutes), because it is 2.0 minutes less than the mean.

In essence, the standard deviation is a measure of the average of all the deviations from the mean. However, if we simply computed the mean of all the deviations, the result would always be zero, because the positive deviations exactly balance the negative deviations. Therefore, we instead use the procedure in the following box, in which we make all the values positive by first squaring all the deviations (because squares are always positive or zero).

Technical Note

The standard deviation formula given here, which we use throughout this text, is technically valid only for data from samples. (When dealing with populations, we do not subtract the 1 in Step 4.) We divide by n−1n−1 n minus 1 because that is the number of independent sample values; we say that the sample values have n−1n−1 n minus 1 degrees of freedom, because we can freely assign the first n−1n−1 n minus 1 values but are then left with only one choice for the final value.

Calculating the Standard Deviation

To calculate the standard deviation for any data set:

· Step 1. Compute the mean of the data set. Then find the deviation from the mean for every data value by subtracting the mean from the data value. That is, for every data value,

deviation from mean=data value−meandeviation from mean=data value−mean

deviationfrommean . equals . datavalue . minus , mean

· Step 2. Find the squares of all the deviations from the mean.

· Step 3. Add all the squares of the deviations from the mean.

· Step 4. Divide this sum by the total number of data values minus 1.

· Step 5. The standard deviation is the square root of this quotient. Overall, these steps can be summarized in the following standard deviation formula:

standard deviation=sum of (deviations from the mean)2total number of data values−1−−−−−−−−−−−−−−−−−−−−−−−−−−−−√standard deviation=sum of (deviations from the mean)2total number of data values−1

standarddeviation . equals . square root of fraction sumof , . open . deviationsfromthemean . close squared , over totalnumberofdatavalues . minus 1 end fraction end root

Note that, because we square the deviations in Step 3 and then take the square root in Step 5, the units of the standard deviation are the same as the units of the data values. For example, if the data values have units of minutes, the standard deviation also has units of minutes. (The result of Step 4 is called the variance of the distribution. It is the square of the standard deviation and therefore has units that are squares of the units used for the original data; for example, if the original data are in meters, the variance will be in units of meters2.meters2. meters squared . . Although the variance is used in many advanced statistical computations, we will not use it in this text.)

The standard deviation formula is easy to use in principle, but the calculations become tedious for all but the smallest data sets. As a result, the standard deviation is usually calculated with the aid of a calculator or computer. Nevertheless, you’ll find the standard deviation formula easier to understand if you try a few examples in which you work through the calculations in detail.

Example 4 Calculating Standard Deviation

Calculate the standard deviations for the waiting times at Big Bank and Best Bank.

SOLUTION

We follow the five steps to calculate the standard deviations. Table 4.4 on the next page shows how to organize the work in the first three steps. The first column for each bank lists the waiting times (in minutes). The second column lists the deviations from the mean (Step 1), which we already know to be 7.2 minutes for both banks. The third column lists the squares of the deviations (Step 2). We add all the squared deviations to find the sum at the bottom of the third column (Step 3). For Step 4, we divide the sums from Step 3 by the total number of data values minus 1. Because there are 11 data values, we divide by 10:

Big Bank: 38.4610Best Bank: 1.9810=3.846 minutes=0.198 minutes

Data in Table 4.3.

4 Describing Data
 > 
4.3 Measures of Variation
 > 
Percentiles

Finally, Step 5 tells us that the standard deviations are the square roots of the numbers from Step 4:

Big Bank:standard deviationBest Bank:standard deviation=3.846−−−−√≈1.96 minutes=0.198−−−−√≈0.44 minuteBig Bank:standard deviation=3.846≈1.96 minutesBest Bank:standard deviation=0.198≈0.44 minute

table with 2 rows and 2 columns , row1 column 1 , cap bigcap bank , colon . standarddeviation , column 2 equals , square root of 3.846 , almost equal to , 1.96 . minutes , row2 column 1 , cap bestcap bank . colon . standarddeviation , column 2 equals , square root of 0.198 , almost equal to , 0.44 . minute , end table

We conclude that the standard deviation of the waiting times is about 1.96 minutes at Big Bank and 0.44 minute at Best Bank. As we expected, the waiting times showed greater variation at Big Bank, which is why the lines at Big Bank likely annoyed more customers than did those at Best Bank.

Big Bank

Best Bank

Time

Deviation(time−mean)Deviation(time−mean)cap deviation . open , time , minus , mean , close

(Deviation)2(Deviation)2open . cap deviation . close squared

4.1

6.6

5.2

6.7

5.6

6.7

6.7−7.2=−0.56.7−7.2=−0.56.7 minus 7.2 equals negative 0.5

(−0.5)2=0.25(−0.5)2=0.25open , negative 0.5 , close squared . equals , 0.25

6.2

6.9

6.7

6.7−7.2=−0.56.7−7.2=−0.56.7 minus 7.2 equals negative 0.5

(−0.5)2=0.25(−0.5)2=0.25open , negative 0.5 , close squared . equals , 0.25

7.1

7.2

7.2

7.2−7.2=0.07.2−7.2=0.07.2 minus 7.2 equals 0.0

(0.0)2=0.0(0.0)2=0.0open 0.0 close squared . equals 0.0

7.7

7.3

7.7

7.7−7.2=0.57.7−7.2=0.57.7 minus 7.2 equals 0.5

(0.5)2=0.25(0.5)2=0.25open 0.5 close squared . equals , 0.25

7.4

8.5

7.7

7.7−7.2=0.57.7−7.2=0.57.7 minus 7.2 equals 0.5

(0.5)2=0.25(0.5)2=0.25open 0.5 close squared . equals , 0.25

9.3

7.8

11.0

7.8

7.8−7.2=0.67.8−7.2=0.67.8 minus 7.2 equals 0.6

(0.6)2=0.36(0.6)2=0.36open 0.6 close squared . equals , 0.36

 

 

 

 

TABLE 4.4 Calculating Standard Deviation

Time

Deviation(time−mean)Deviation(time−mean)cap deviation . open , time , minus , mean , close

(Deviation)2(Deviation)2open . cap deviation . close squared

4.1−7.2=−3.14.1−7.2=−3.14.1 minus 7.2 equals negative 3.1

(−3.1)2=9.61(−3.1)2=9.61open , negative 3.1 , close squared . equals , 9.61

6.6−7.2=−0.66.6−7.2=−0.66.6 minus 7.2 equals negative 0.6

(−0.6)2=0.36(−0.6)2=0.36open , negative 0.6 , close squared . equals , 0.36

5.2−7.2=−2.05.2−7.2=−2.05.2 minus 7.2 equals negative 2.0

(−2.0)2=4.00(−2.0)2=4.00open , negative 2.0 , close squared . equals , 4.00

6.7−7.2=−0.56.7−7.2=−0.56.7 minus 7.2 equals negative 0.5

(−0.5)2=0.25(−0.5)2=0.25open , negative 0.5 , close squared . equals , 0.25

5.6−7.2=−1.65.6−7.2=−1.65.6 minus 7.2 equals negative 1.6

(−1.6)2=2.56(−1.6)2=2.56open , negative 1.6 , close squared . equals , 2.56

6.2−7.2=−1.06.2−7.2=−1.06.2 minus 7.2 equals negative 1.0

(−1.0)2=1.00(−1.0)2=1.00open , negative 1.0 , close squared . equals , 1.00

6.9−7.2=−0.36.9−7.2=−0.36.9 minus 7.2 equals negative 0.3

(−0.3)2=0.09(−0.3)2=0.09open , negative 0.3 , close squared . equals , 0.09

7.1−7.2=−0.17.1−7.2=−0.17.1 minus 7.2 equals negative 0.1

(−0.1)2=0.01(−0.1)2=0.01open , negative 0.1 , close squared . equals , 0.01

7.2−7.2=0.07.2−7.2=0.07.2 minus 7.2 equals 0.0

(0.0)2=0.0(0.0)2=0.0open 0.0 close squared . equals 0.0

7.7−7.2=0.

57

.7−7.2=0.57.7 minus 7.2 equals 0.5

(0.5)2=0.25(0.5)2=0.25open 0.5 close squared . equals , 0.25

7.3−7.2=0.17.3−7.2=0.17.3 minus 7.2 equals 0.1

(0.1)2=0.01(0.1)2=0.01open 0.1 close squared . equals , 0.01

7.4−7.2=0.27.4−7.2=0.27.4 minus 7.2 equals 0.2

(0.2)2=0.04(0.2)2=0.04open 0.2 close squared . equals , 0.04

8.5−7.2=1.38.5−7.2=1.38.5 minus 7.2 equals 1.3

(1.3)2=1.69(1.3)2=1.69open 1.3 close squared . equals , 1.69

9.3−7.2=2.19.3−7.2=2.19.3 minus 7.2 equals 2.1

(2.1)2=4.41(2.1)2=4.41open 2.1 close squared . equals , 4.41

7.8−7.2=0.67.8−7.2=0.67.8 minus 7.2 equals 0.6

(0.6)2=0.36(0.6)2=0.36open 0.6 close squared . equals , 0.36

11.0−7.2=3.811.0−7.2=3.811.0 , minus 7.2 equals 3.8

(3.8)2=14.44(3.8)2=14.44open 3.8 close squared . equals , 14.44

Sum=38.46Sum=38.46cap sum equals , 38.46

Sum=1.98Sum=1.98cap sum equals , 1.98

THINK ABOUT IT Look closely at the individual deviations in Table 4.4 in Example 4. Do the standard deviations for the two data sets seem like reasonable “averages” for the deviations? Explain.

 Using Technology: Standard Deviation

EXCEL The built-in Excel function STDEV automates the calculation of standard deviation, so that all you have to do is enter the data and then use the function. The screen shot below shows the process for the Big Bank data, with the functions shown in Column E and the results in Column F.

Source: MS Excel 2013.

Note: Alternatively, the procedures described in Section 4.1 using the Data Analysis Toolpak or XLSTAT will give results that include the value of the standard deviation.

STATDISK or TI 83/84 Use the procedures described in Section 4.1, and the results will include the value of the standard deviation.

4 Describing Data
 > 
4.3 Measures of Variation
 > 

Interpreting the Standard Deviation

Interpreting the Standard Deviation

A good way to develop a deeper understanding of the standard deviation is to consider an approximation called the range rule of thumb, summarized in the following box.

Technical Note

Another way of interpreting the standard deviation uses a mathematical rule called Chebyshev’s theorem. It states that, for any data distribution, at least

75%

of all data values lie within two standard deviations of the mean, and at least 89% of all data values lie within three deviations of the mean.

The Range Rule of Thumb

The standard deviation is approximately related to the range of a distribution by the range rule of thumb:

standard deviation≈range4standard deviation≈range4

standarddeviation . almost equal to , fraction range , over 4 end fraction

If we know the range of a distribution (range=high−low),(range=high−low), open . range , equals , high , minus low . close . comma we can use this rule to estimate the standard deviation. Alternatively, if we know the standard deviation, we can use this rule to estimate the low and high values as follows:

low value≈high value≈mean−(2×standard deviation)mean+(2×standard deviation)low value≈mean−(2×standard deviation)high value≈mean+(2×standard deviation)

table with 2 rows and 2 columns , row1 column 1 , lowvalue . almost equal to , column 2 mean , minus . open . 2 times . standarddeviation . close , row2 column 1 , highvalue . almost equal to , column 2 mean , plus . open . 2 times . standarddeviation . close , end table

The range rule of thumb does not work well when the high or low values are outliers.

The range rule of thumb works reasonably well for data sets in which values are distributed fairly evenly. It does not work well when the high or low values are extreme outliers. You must therefore use judgment in deciding whether the range rule of thumb is applicable in a particular case, and in all cases remember that the range rule of thumb yields rough approximations, not exact results.

Example 5 Using the Range Rule of Thumb

Use the range rule of thumb to estimate the standard deviations for the waiting times at Big Bank and Best Bank. Compare the estimates to the actual values found in Example 4.

SOLUTION

The waiting times for Big Bank vary from 4.1 to 11.0 minutes, which means a range of 11.0−4.1=6.9 minutes.11.0−4.1=6.9 minutes. 11.0 , minus 4.1 equals 6.9 . minutes , . The waiting times for Best Bank vary from 6.6 to 7.8 minutes for a range of 7.8−6.6=1.2 minutes.7.8−6.6=1.2 minutes. 7.8 minus 6.6 equals 1.2 . minutes , . Thus, the range rule of thumb gives the following estimates for the standard deviations:

Big Bank: standard deviation≈Best Bank: standard deviation≈6.94=1.7 minutes1.24=0.3 minuteBig Bank: standard deviation≈6.94=1.7 minutesBest Bank: standard deviation≈1.24=0.3 minute

table with 2 rows and 2 columns , row1 column 1 , cap bigcap bank , colon . standarddeviation . almost equal to , column 2 6.9 over 4 , equals 1.7 . minutes , row2 column 1 , cap bestcap bank . colon . standarddeviation . almost equal to , column 2 1.2 over 4 , equals 0.3 . minute , end table

The actual standard deviations calculated in Example 4 are 1.96 and 0.44, respectively. For these two cases, the estimates from the range rule of thumb slightly underestimate the actual standard deviations. Nevertheless, the estimates put us in the right ballpark, showing that the rule is useful.

By the Way

Technologies such as catalytic converters have helped reduce the amounts of many pollutants emitted by cars (per mile driven), but burning less gasoline is the only way to reduce the carbon dioxide emissions that are responsible for global warming. This is a major reason why auto manufacturers are developing high-mileage hybrid vehicles and zero-emission vehicles that run on electricity or fuel cells.

Example 6 Estimating a Range

Studies of the gas mileage of a Prius under varying driving conditions show that it gets a mean of 45 miles per gallon with a standard deviation of 4 miles per gallon. Estimate the minimum and maximum typical gas mileage amounts that a Prius owner can expect under ordinary driving conditions.

SOLUTION

From the range rule of thumb, the low and high values for gas mileage are approximately

low value≈high value≈mean−(2×standard deviation)=45−(2×4)=37mean+(2×standard deviation)=45+(2×4)=53low value≈mean−(2×standard deviation)=45−(2×4)=37high value≈mean+(2×standard deviation)=45+(2×4)=53

table with 2 rows and 2 columns , row1 column 1 , lowvalue . almost equal to , column 2 mean , minus . open . 2 times . standarddeviation . close . equals 45 minus . open , 2 times 4 , close . equals 37 , row2 column 1 , highvalue . almost equal to , column 2 mean , plus . open . 2 times . standarddeviation . close . equals 45 plus . open , 2 times 4 , close . equals 53 , end table

The range of gas mileage for the car extends roughly from a minimum of 37 miles per gallon to a maximum of 53 miles per gallon.

4 Describing Data

 > 

4.3 Measures of Variation

 > 

Standard Deviation with Summation Notation (Optional Subsection)

Standard Deviation with Summation Notation (Optional Subsection)

The summation notation introduced earlier in this chapter makes it easy to write the standard deviation formula in a compact form. Recall that x represents the individual values in a data set and x¯x¯ x bar represents the mean of the data set. We can therefore write the deviation from the mean for any data value as

deviation=data value−mean=x−x¯deviation=data value−mean=x−x¯

deviation . equals . datavalue . minus , mean , equals x minus , x bar

We can now write the sum of all squared deviations as

sum of all squared deviations=∑(x−x¯)2sum of all squared deviations=∑(x−x¯)2

sumofallsquareddeviations . equals sum . open . x minus , x bar . close squared

The remaining steps in the calculation of the standard deviation are to divide this sum by n−1n−1 n minus 1 and then take the square root. You should confirm that the following formula summarizes the five steps in the earlier box:

standard deviation: s=∑(x−x¯)2n−1−−−−−−−−−−√standard deviation: s=∑(x−x¯)2n−1

standarddeviation . colon s equals . square root of fraction sum . open . x minus , x bar . close squared , over n minus 1 end fraction end root

Technical Note

The formula for the variance is

s2=∑(x−x¯)2n−1s2=∑(x−x¯)2n−1

s squared , equals . fraction sum . open . x minus , x bar . close squared , over n minus 1 end fraction

The standard symbol for the variance, s2,s2, s squared , comma reflects the fact that it is the square of the standard deviation.

The symbol s is the conventional symbol for the standard deviation of a sample. For the standard deviation of a population, statisticians use the Greek letter σσ sigma (lowercase sigma), and the term n−1n−1 n minus 1 in the formula is replaced by N (the population size). Consequently, you will get slightly different results for the standard deviation depending on whether the data represent a sample or a population.

Section 4.3 Exercises

Statistical Literacy and Critical Thinking

· 1. Variation Matters. Consider two grocery stores at which the mean waiting time in line is the same but the variation of waiting times is different. At which store would you expect the customers to have more complaints about the waiting time? Explain.

· 2. Variation Measures. Briefly distinguish between the following measures of variation: range, five-number summary, and standard deviation. Which one(s) can be represented with a boxplot?

· 3. Quartiles and Percentiles. Briefly describe how you find quartiles and percentiles for a data set.

· 4. Standard Deviation. Describe the process of calculating a standard deviation. What is the standard deviation if all of the sample values are the same?

Does It Make Sense? For Exercises 5–8, determine whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer.

· 5. Baseball Salaries. For a recent year, baseball salaries had a median of $1,650,000 and a first quartile (Q1)(Q1) open , q sub 1 , close of $1,675,000.

· 6. Range versus Standard Deviation. I examined the data carefully, and the standard deviation was greater than the range.

· 7. Heights. The standard deviation for the heights of 5-year-old children is smaller than the standard deviation for the heights of children who range in age from 3 to 15.

· 8. Exam Results. The scores on the statistics exam had the following results: range=40;high=98;low=58;median=81;standard deviation=8.range=40;high=98;low=58;median=81;standard deviation=8.range , equals 40 semicolon , high , equals 98 semicolon low equals 58 semicolon , median , equals 81 semicolon . standarddeviation . equals 8 .

Concepts and Applications

Range and Standard Deviation. Exercises 9–16 each provide a set of numbers. In each case, find the range and standard deviation. (The same data sets were used in Exercises 10–17 in Section 4.1.)

· 9. Top 5 Celebrity Incomes. Listed below are the earnings (in millions of dollars) of the celebrities with the five highest incomes in a recent year. The celebrities in order are Steven Spielberg, Howard Stern, George Lucas, Oprah Winfrey, and Jerry Seinfeld.

332

302

235

225

100

· 10. Car Crash Test Measurements. Listed below are measurements of the “head injury criterion” (hic) for the following seven small cars tested in crashes by the National Highway Traffic Safety Administration: Chevrolet Aveo, Honda Civic, Volvo S40, VW Jetta, Hyundai Elantra, Kia Rio, and Kia Spectra. Higher numbers are associated with a higher risk of injury.

371

356

393

544

326

520

501

·

4 Describing Data
 > 
4.3 Measures of Variation
 > 
Standard Deviation with Summation Notation (Optional Subsection)

· 11. Lead in Medicine. Ayurveda is a traditional medical system commonly used in India. Listed below are the lead concentrations (in μg/g)(in μg/g) open in mu gslashg close measured in different Ayurveda medicines (manufactured in the United States). (Data from “Lead, Mercury, and Arsenic in U.S. and Indian Manufactured Ayurvedic Medicines Sold via the Internet,” by Saper et al., Journal of the American Medical Association, Vol. 300, No. 8.)

3.0

6.5

6.0

5.5

20.5

7.5

12.0

20.5

11.5

17.5

· 12. Saints in Super Bowl. Listed below are the numbers on the jerseys of the starting lineup for the New Orleans Saints when they won their first Super Bowl football game. Does it make sense to compute the range and standard deviation for these data?

9

23

25

88

12

19

74

77

76

73

78

· 13. Perception of Time. The following times (in seconds) were recorded when statistics students participated in an experiment to test their ability to determine when 1 minute (60 seconds) had passed:

53

52

75

62

68

58

49

49

· 14. Body Temperatures. The following are the body temperatures (in °F°F degrees cap f) of randomly selected normal, healthy adults:

98.6

98.6

98.0

98.0

99.0

98.4

98.4

98.4

98.4

98.6

· 15. Blood Alcohol and Driving. Listed below are measurements of blood alcohol concentration (BAC) of drivers involved in fatal crashes and then given jail sentences (based on data from the U.S. Department of Justice):

0.27

0.17

0.17

0.16

0.13

0.24

0.29

0.24

0.14

0.16

0.12

0.16

· 16. Cell Phone Radiation. Listed below are measurements of “specific absorption rate” (SAR) of radiation (in W/kg) when these types of cell phones are held to the head: Samsung SGH-tss9, Blackberry Storm, Blackberry Curve, Motorola Moto, T-Mobile Sidekick, Sanyo Katana Eclipse, Palm Pre, Sony Ericsson, Nokia 6085, Apple iPhone 3G S, and Kyocero Neo E1100. (Data from the Environmental Working Group.)

0.38

0.55

1.54

1.55

0.50

0.60

0.92

0.96

1.00

0.86

1.46

Comparing Variation. In Exercises 17–20, find the range and standard deviation for each of the two samples and then compare the two sets of results. (The same data sets were used in Exercises 22–25 in Section 4.1.)

· 17. Speeding and Race. Listed below are speeds (in mi/h) of cars on the New Jersey Turnpike. All cars are going in the same direction, and all of the cars are from New Jersey. The speeds were measured with a radar gun, and the researchers observed the races of the drivers. (The data are from Statlib and were published by Joseph Kadane and John Lamberth.)

White driver

74

77

69

71

77

69

72

75

74

72

African-American driver

79

70

71

76

76

74

71

75

74

74

· 18. Parking Meter Theft. Listed below are amounts (in millions of dollars) collected from parking meters by Brinks and others in New York City during similar time periods. A larger data set was used to convict five Brinks employees of grand larceny. (The data were provided by the attorney for New York City and are listed on the DASL website.)

Collection contractor was Brinks

1.3

1.5

1.3

1.5

1.4

1.7

1.8

1.7

1.7

1.6

Collection contractor was not Brinks

2.2

1.9

1.5

1.6

1.5

1.7

1.9

1.6

1.6

1.8

· 19. Political Contributions. Listed below are contributions (in dollars) made to the two presidential candidates in a recent election. All contributions are from the same ZIP code. (The data are from the Huffington Post.)

Democrat

275

452

300

1000

1000

500

100

1061

1200

235

875

2000

350

210

250

Republican

50

75

240

302

250

700

350

500

1250

1500

500

500

40

221

400

· 20. Customer Waiting Times. Waiting times (in minutes) of customers at the Jefferson Valley Bank (where all customers enter a single waiting line) and the Bank of Providence (where customers wait in individual lines at three different teller windows) are listed below.

Jefferson Valley (single line)

6.5

6.6

6.7

6.8

7.1

7.3

7.4

7.7

7.7

7.7

Providence (individual lines)

4.2

5.4

5.8

6.2

6.7

7.7

7.7

8.5

9.3

10.0

· 21. Calculating Percentiles. A statistics professor with too much time on his hands weighed each M&M candy in a bag of 465 plain M&M candies.

· a. One of the 465 M&Ms weighed 0.776 gram and it was heavier than 25 of the other M&Ms. What is the percentile of this particular value?

· b. One of the 465 M&Ms weighed 0.876 gram and it was heavier than 322 of the other M&Ms. What is the percentile of this particular value?

· c. One of the 465 M&Ms weighed 0.856 gram and it was heavier than 224 of the other M&Ms. What is the percentile of this particular value?

· 22. Calculating Percentiles. A data set consists of the ages of 87 female actors at the time they won an Academy Award.

· a. One of the women was 50 years old, and she was older than 77 of the other winners. What is the percentile of the age of 50?

· b. One of the women was 25 years old, and she was older than 5 of the other winners. What is the percentile of the age of 25?

· c. One of the women was 80 years old, and she was older than 86 of the other winners. What is the percentile of the age of 80?

· 23. Understanding Standard Deviation. The following four sets of seven numbers all have a mean of 9.

{9,9,9,9,9,9,9}{8,8,8,9,10,10,10}{8,8,9,9,9,10,10}{6,6,6,9,12,12,12}{9,9,9,9,9,9,9}{8,8,9,9,9,10,10}{8,8,8,9,10,10,10}{6,6,6,9,12,12,12}

table with 2 rows and 2 columns , row1 column 1 , the set . 9 comma 9 comma 9 comma 9 comma 9 comma 9 comma 9 end set , column 2 the set . 8 comma 8 comma 9 comma 9 comma 9 comma 10 comma 10 end set , row2 column 1 , the set . 8 comma 8 comma 8 comma 9 comma 10 comma 10 comma 10 end set , column 2 the set . 6 comma 6 comma 6 comma 9 comma 12 comma 12 comma 12 end set , end table

· a. Construct a histogram for each set.

· b. Give the five-number summary and draw a boxplot for each set.

(continued)

4 Describing Data
 > 
4.3 Measures of Variation
 > 
Standard Deviation with Summation Notation (Optional Subsection)

·

· c. Compute the standard deviation for each set.

· d. Based on your results, briefly explain how the standard deviation provides a useful single-number summary of the variation in these data sets.

· 24. Understanding Standard Deviation. The following four sets of seven numbers all have a mean of 6.

{6,6,6,6,6,6,6}{5,5,5,6,7,7,6,7}{5,5,6,6,6,7,7}{3,3,3,6,9,9,9}{6,6,6,6,6,6,6}{5,5,6,6,6,7,7}{5,5,5,6,7,7,6,7}{3,3,3,6,9,9,9}

table with 2 rows and 2 columns , row1 column 1 , the set . 6 comma 6 comma 6 comma 6 comma 6 comma 6 comma 6 end set , column 2 the set . 5 comma 5 comma 6 comma 6 comma 6 comma 7 comma 7 end set , row2 column 1 , the set . 5 comma 5 comma 5 comma 6 comma 7 comma 7 comma 6 comma 7 end set , column 2 the set . 3 comma 3 comma 3 comma 6 comma 9 comma 9 comma 9 end set , end table

· a. Make a histogram for each set.

· b. Give the five-number summary and draw a boxplot for each set.
· c. Compute the standard deviation for each set.
· d. Based on your results, briefly explain how the standard deviation provides a useful single-number summary of the variation in these data sets.

Comparing Data Sets. For each of Exercises 25–28, do the following:

· a. Find the mean, median, and range for each of the two data sets.

·
b. Give the five-number summary and draw a boxplot for each of the data sets.

·
c. Find the standard deviation for each of the data sets.

·
d. Apply the range rule of thumb to estimate the standard deviation of each of the data sets. How well does the rule work in each case? Briefly discuss why it does or does not work well.

·
e. Based on all your results, compare and discuss the two data sets in terms of their center and variation.

· 25. The following data sets give the ages in years of a sample of cars in a faculty parking lot and a student parking lot at the College of Portland.

2

3

1

0

1

2

4

3

3

2

1

5

6

8

2

7

10

1

4

6

10

9

Faculty

Student

· 26. The following data sets give the driving speeds (in mi/h) of the first nine cars to pass through a school zone and the first nine cars to pass through a downtown intersection.

20

18

23

21

19

18

17

24

25

29

31

35

24

31

26

36

31

28

School

Downtown

· 27. The following data sets show the ages of the first seven U.S. Presidents (Washington through Jackson) and seven recent U.S. Presidents (Ford through Obama) at the time of their inaugurations.

57

57

58

57

61

61

52

69

46

47

First seven

57 61

Recent seven

64

54

· 28. The following data sets give the approximate lengths (in minutes) of Beethoven’s nine symphonies and Mahler’s nine symphonies.

28

36

50

33

30

40

38

26

68

52

50

72

72

90

80

Beethoven

Mahler

85 94 80

· 29. Manufacturing. You are in charge of a manufacturing process that produces car batteries that are supposed to provide 12 volts of power. Manufacturing occurs at two different sites. The first site produces batteries with a mean output of 12.1 volts and a standard deviation of 0.5 volt, while the second site produces batteries with a mean output of 12.2 volts and a standard deviation of 0.1 volt. Which site has better-quality production? Why?

· 30. Managing Complaints. You manage a small ice cream shop in which employees scoop the ice cream by hand. Each night, you total the day’s sales and the total volume of ice cream sold. You find that on nights when an employee named Ben is working, the mean price of the ice cream sold is $1.75 per pint with a standard deviation of $0.05. On nights when an employee named Jerry is working, the mean price of the ice cream sold is $1.70 per pint with a standard deviation of $0.35. Which employee is more likely to be generating complaints that servings are “too small”? Explain.

· 31. Portfolio Standard Deviation. The book Investments, by Zvi Bodie, Alex Kane, and Alan Marcus, claims that the annual percentage returns for investment portfolios with a single stock have a standard deviation of 0.55, while the annual percentage returns for portfolios with 32 stocks have a standard deviation of 0.325. Explain how the standard deviation measures the risk in these two types of portfolios.

· 32. Batting Standard Deviation. For the past 100 years, the mean batting average of major league baseball players has remained fairly constant at about 0.260. However, the standard deviation of batting averages has decreased from about 0.049 in the 1870s to 0.031 today. What does this tell us about the batting averages of players? Based on these facts, would you expect batting averages above 0.350 to be more or less common today than in the past? Explain.

 PROJECTS FOR THE INTERNET & BEYOND

· 33. Secondhand Smoke. At the websites of the American Lung Association and the U.S. Environmental Protection Agency, find statistical data concerning the health effects of secondhand (passive) smoke. Write a short summary of your findings, and state your opinions about whether and how this health issue should be addressed by government.

· 34. Kids and the Media. A recent study by the Kaiser Family Foundation looked at the role of media (for example, television, books, and computers) in the lives of children. The report, which is on the Kaiser Family Foundation website, gives many data distributions concerning, for example, how much time children spend daily with each medium. Study at least three of the distributions in the report that you find particularly interesting. Summarize each distribution in words, and discuss your opinions of the social consequences of the findings.

· 35. Measuring Variation. The range and standard deviation represent different approaches for measuring variation in a data set. Construct two different data sets configured so that the range of the first set is greater than the range of the second set (suggesting that the first set has more variation) but the standard deviation of the first set is less than the standard deviation of the second set (suggesting that the first set has less variation).

4 Describing Data
 > 

4.4 Statistical Paradoxes

IN THE NEWS

· 36. Ranges in the News. Find two examples of data distributions in recent news reports; they may be given either as tables or as graphs. In each case, state the range of the distribution and explain its meaning in the context of the news report. Estimate the standard deviation by applying the range rule of thumb.

· 37. Summarizing a News Data Set. Find an example of a data distribution given in the form of a table in a recent news report. Create a five-number summary and a boxplot for the distribution.

4.4 Statistical Paradoxes

The federal government administers polygraph tests (“lie detectors”) to new applicants for sensitive security jobs. The polygraph tests are reputed to be 90% accurate; that is, they catch 90% of the people who are lying and validate 90% of the people who are truthful. Most people therefore guess that only 10% of the people who fail a polygraph test have been falsely identified as lying. In fact, the actual percentage of false identifications can be much higher—more than 90% in some cases. How can this be?

We’ll discuss the answer soon, but the moral of this story should already be clear: Even when we describe data carefully according to the principles discussed in the first three sections of this chapter, we may still be led to very surprising conclusions. Before we get to the polygraph issue, let’s start with a couple of other statistical surprises.

Better in Each Case, but Worse Overall

Suppose a pharmaceutical company creates a new treatment for acne. To decide whether the new treatment is better than an old treatment, the company gives the old treatment to 90 patients and gives the new treatment to 110 patients. Some patients had mild acne and others had severe acne. Table 4.5 summarizes the results after four weeks of treatment, broken down according to which treatment was given and whether the patient’s acne was mild or severe. If you study the table carefully, you will notice these key facts:

· • Among patients with mild acne,

10 received the old treatment and 2 were cured, for a 20% cure rate;

90 received the new treatment and 30 were cured, for a

33%

cure rate.

· • Among patients with severe acne,

80 received the old treatment and 40 were cured, for a

50%

cure rate;

20 received the new treatment and 12 were cured, for a 60% cure rate.

 

Cured

Not cured

2

8

40

40

30

12

8

TABLE 4.5 Results of Acne Treatments

 

Mild acne

Severe acne

Cured

Not cured

Old treatment

New treatment

60

By the Way

When a set of data gives different results for each of several group comparisons than it does when the groups are taken together, this phenomenon is known as Simpson’s paradox, so named because it was described by Edward Simpson in 1951. However, the same idea was actually described around 1900 by Scottish statistician George Yule.

Notice that the new treatment had a higher cure rate both for patients with mild acne (33% for the new treatment versus 20% for the old) and for patients with severe acne (60% for the new treatment versus 50% for the old). Is it therefore fair for the company to claim that their new treatment is better than the old treatment?

At first, this might seem to make sense. But instead of looking at the data for the mild and severe acne patients separately, let’s look at the overall results:

· • A total of 90 patients received the old treatment and 42 were cured (2 out of 10 with mild acne and 40 out of 80 with severe acne), for an overall cure rate of 42/90=46.7%.42/90=46.7%.42 slash 90 equals , 46.7 , percent .

· • A total of 110 patients received the new treatment and 42 were cured (30 out of 90 with mild acne and 12 out of 20 with severe acne), for an overall cure rate of 42/110=38.2%.42/110=38.2%.42 slash 110 equals , 38.2 , percent .

4 Describing Data

 > 

4.4 Statistical Paradoxes

 > 

Does a Positive Mammogram Mean Cancer?

Overall, the old treatment had the higher cure rate, despite the fact that the new treatment had a higher rate for both mild and severe acne cases.

This example illustrates that it is possible for something to appear better in each of two or more group comparisons but actually be worse overall. If you look carefully, you’ll see that this occurs because of the way in which the overall results are divided into unequally sized groups (in this case, mild acne patients and severe acne patients).

Example 1 Who Played Better?

Table 4.6 gives the shooting performance of two players in each half of a basketball game.

Sheryl

had a higher shooting percentage in both the first half (40% to 25%) and the second half (75% to 70%). Can Sheryl claim that she had the better game?

 

Baskets

Attempts

Percent

4

10

3

4

1

4

7

10

TABLE 4.6 Basketball Shots

First half

Second half

Player

Baskets

Attempts

Percent

Sheryl 40% 75%

Candace

25% 70%

SOLUTION

No, and we can see why by looking at the overall game statistics. Sheryl made a total of 7 baskets (4 in the first half and 3 in the second half) on 14 shots (10 in the first half and 4 in the second half), for an overall shooting percentage of 7/14=50%.7/14=50%. 7 slash 14 equals 50 percent . Candace made a total of 8 baskets on 14 shots, for an overall shooting percentage of 8/14=57.1%8/14=57.1% 8 slash 14 equals , 57.1 , percent Surprisingly, even though Sheryl had a higher shooting percentage in both halves, Candace had a better overall shooting percentage for the game.

Does a Positive Mammogram Mean Cancer?

We often associate tumors with cancers, but most tumors are not cancers. Medically, any kind of abnormal swelling or tissue growth is considered a tumor. A tumor caused by cancer is said to be malignant (or cancerous); all other tumors are said to be benign.

By the Way

This mammogram example and the polygraph example that follows it illustrate cases in which conditional probabilities (discussed in Section 6.5) lead to confusion. The proper way of handling conditional probabilities was discovered by the Reverend Thomas Bayes (1702–1761) and is often called Bayes rule.

Imagine you are a doctor or nurse treating a patient who has a breast tumor. The patient will be understandably nervous, but you can give her some comfort by telling her that only about 1 in 100 breast tumors turns out to be malignant. But, just to be safe, you order a mammogram to determine whether her tumor is one of the 1% that are malignant.

Now, suppose the mammogram comes back positive, suggesting that the tumor is malignant. Mammograms are not perfect, so the positive result does not necessarily mean that your patient has breast cancer. More specifically, let’s assume that the mammogram screening is 85% accurate: It will correctly identify 85% of malignant tumors as malignant and 85% of benign tumors as benign. When you tell your patient that her mammogram was positive, what should you tell her about the chance that she actually has cancer?

Because the mammogram screening is 85% accurate, most people guess that the positive result means that the patient probably has cancer. Studies have shown that most doctors also believe this to be the case and would tell the patient to be prepared for cancer treatment. But a more careful analysis shows otherwise. In fact, the chance that the patient has cancer is still quite small—about 5%. We can see why by analyzing some numbers.

Consider a study in which mammograms are given to 10,000 women with breast tumors. Assuming that 1% of tumors are malignant, 1%×10,000=1001%×10,000=100 1 percent times , 10,000 , equals 100 of the women actually have cancer; the remaining 9,900 women have benign tumors. Table 4.7 summarizes the mammogram results. Notice the following:

· • The mammogram screening correctly identifies 85% of the 100 malignant tumors as malignant. Thus, it gives positive (malignant) results for 85 of the malignant tumors; these cases are called true positives. In the other 15 malignant cases, the result is negative, even though the women actually have cancer; these cases are false negatives.

· • The mammogram screening correctly identifies 85% of the 9,900 benign tumors as benign. Thus, it gives negative (benign) results for 85%×9,900=8,41585%×9,900=8,415 85 percent times , 9,900 , equals , 8,415 of the benign tumors; these cases are true negatives. The remaining 9,900−8,415=1,4859,900−8,415=1,485 9,900 , minus , 8,415 , equals , 1,485 women get positive results in which the mammogram incorrectly identifies their tumors as malignant; these cases are false positives.

4 Describing Data

 > 

4.4 Statistical Paradoxes

 > 

Polygraphs and Drug Tests

 

Total

TABLE 4.7 Table Summary of Results for 10,000 Mammograms (when in fact 100 tumors are malignant and 9,900 are benign)

Tumor is malignant

Tumor is benign

Total

Positive mammogram

85 true positives

1,485 false positives

1,570

Negative mammogram

15 false negatives

8,415 true negatives

8,430

100

9,900

10,000

By the Way

The accuracy of breast cancer screening is rapidly improving; newer technologies, including digital mammograms and ultrasounds, appear to achieve accuracies near 98%. The most definitive test for cancer is a biopsy, though even biopsies can miss cancers if they are not performed with sufficient care. If you have negative tests but are still concerned about an abnormality, ask for a second opinion.

Overall, the mammogram screening gives positive results to 85 women who actually have cancer and to 1,485 women who do not have cancer. The total number of positive results is 85+1,485=1,570.85+1,485=1,570. 85 plus , 1,485 , equals , 1,570 , . Because only 85 of these are true positives (the rest are false positives), the chance that a positive result really means cancer is only 85/1,570=0.054, or 5.4%.85/1,570=0.054, or 5.4%. 85 slash , 1,570 , equals , 0.054 , comma , or , 5.4 percent . Therefore, when your patient’s mammogram comes back positive, you should reassure her that there’s still only a small chance that she has cancer.

Example 2 False Negatives

Suppose you are a doctor seeing a patient with a breast tumor. Her mammogram comes back negative. Based on the numbers in Table 4.7, what is the chance that she has cancer?

SOLUTION

For the 10,000 cases summarized in Table 4.7, the mammograms are negative for 15 women with cancer and for 8,415 women with benign tumors. The total number of negative results is 15+8,415=8,430.15+8,415=8,430. 15 plus , 8,415 , equals , 8,430 , . Thus, the fraction of women with cancer who have false negatives is 15/8,430=0.001815/8,430=0.0018 15 slash , 8,430 , equals , 0.0018 or slightly less than 2 in 1,000. In other words, the chance that a woman with a negative mammogram has cancer is only about 2 in 1,000, or 0.2%.

THINK ABOUT IT While the chance of having cancer and receiving a negative mammogram is small, it is not zero. Therefore, it might seem like a good idea to biopsy all tumors, just to be sure. However, biopsies involve surgery, which means they can be painful and expensive, among other things. Given these facts, do you think that biopsies should be routine for all tumors? Should they be routine for cases of positive mammograms? Defend your opinion.

Polygraphs and Drug Tests

We’re now ready to return to the question asked at the beginning of this section: How can a 90% accurate polygraph test lead to a surprising number of false identifications? The explanation is very similar to that used in the case of the mammograms.

Suppose the government gives the polygraph test to 1000 applicants for sensitive security jobs. Further suppose that 990 of these 1000 people tell the truth on their polygraph test, while only 10 people lie. For a test that is 90% accurate, we find the following results:

· • Of the 10 people who lie, the polygraph correctly identifies 90%, meaning that 9 fail the test (they are identified as liars) and 1 passes.

· • Of the 990 people who tell the truth, the polygraph correctly identifies 90%, meaning that 90%×990=89190%×990=891 90 percent times 990 equals 891 truthful people pass the test and the other 10%×990=9910%×990=99 10 percent times 990 equals 99 truthful people fail the test.

Figure 4.16 on the next page summarizes these results. The total number of people who fail the test is 9+99=108.9+99=108. 9 plus 99 equals 108 . Of these, only 9 were actually liars; the other 99 were falsely accused of lying. That is, 99 out of 108, or 99/108=91.7%,99/108=91.7%, 99 slash 108 equals , 91.7 , percent comma of the people who fail the test were actually telling the truth.

The percentage of people who are falsely accused in any real situation depends on both the accuracy of the test and the proportion of people who are lying. Nevertheless, for the numbers given here, we have an astounding result: Assuming the government rejects applicants who fail the polygraph test, then almost 92% of the rejected applicants were actually being truthful and may have been highly qualified for the jobs.

4 Describing Data

 > 

4.4 Statistical Paradoxes

 > 

Section 4.4 Exercises

By the Way

A polygraph, often called a “lie detector,” measures a variety of bodily functions including heart rate, skin temperature, and blood pressure. Polygraph operators look for subtle changes in these functions that typically occur when people lie. However, polygraph results have never been allowed as evidence in criminal proceedings. First of all, 90% accuracy is far too low for justice. In addition, studies show that polygraphs are easily fooled by people who train to beat them.

 
d

Figure 4.16 A tree diagram summarizes results of a 90% accurate polygraph test for 1000 people, of whom only 10 are lying.

THINK ABOUT IT Imagine that you are falsely accused of a crime. The police suggest that, if you are truly innocent, you should agree to take a polygraph test. Would you do it? Why or why not?

Example 3 High School Drug Testing

All athletes participating in a regional high school track-and-field championship must provide a urine sample for a drug test. Those who fail are eliminated from the meet and suspended from competition for the following year. Studies show that, at the laboratory selected, the drug tests are 95% accurate. Assume that 4% of the athletes actually use drugs. What fraction of the athletes who fail the test are falsely accused and therefore suspended without cause?

SOLUTION

The easiest way to answer this question is by using some sample numbers. Suppose there are 1000 athletes who want to participate in the meet. Then 4%, or 40 athletes, actually use drugs; the remaining 960 athletes do not use drugs. In that case, the 95% accurate drug test should return the following results:

· • 95% of the 40 athletes who use drugs, or 0.95×40=380.95×40=38 0.95 , times 40 equals 38 athletes, fail the test. The other 2 athletes who use drugs pass the test.

· • 95% of the 960 athletes who do not use drugs pass the test, but 5% of these 960, or 0.05×960=480.05×960=48 0.05 , times 960 equals 48 athletes, fail.

The total number of athletes who fail the test is 38+48=86.38+48=86. 38 plus 48 equals 86 . But 48 of these athletes who fail the test, or 48/86=56%,48/86=56%, 48 slash 86 equals 56 percent comma are actually nonusers. Despite the 95% accuracy of the drug test, more than half of the suspended students are innocent of drug use.

Section 4.4 Exercises

Statistical Literacy and Critical Thinking

· 1. False Positive and False Negative. Professional athletes are routinely barred from participation for using banned substances. For such a test, what is a false positive? What is a false negative? What is a true positive? What is a true negative?

· 2. Positive Test Result. A professional soccer player is given a test for a banned substance. What does it mean when she is told that the result is positive? Do we know from such a positive result whether the player actually used the banned substance?

· 3. Test Result. If you are pulled over while driving and given a breathalyzer test for alcohol, what is the result called if the test incorrectly indicates that you have consumed alcohol?

· 4. Better in Each Half, Worse Overall. When the Giants and Patriots football teams play each other, is it possible for one of the quarterbacks to have a higher passing percentage in each half while having a lower passing percentage for the entire game?

Does It Make Sense? For Exercises 5–8, determine whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer.

· 5. Tennis. Siena wins each of the first two sets of a tennis tournament by winning more games than her opponent in the first set and also winning more games than her opponent in the second set. It follows that Siena won more games than her opponent overall in the first two sets.

· 6. Batteries. A manufacturer uses two different production sites to make batteries for cell phones. There is a defect rate of 2% at one of the sites and a defect rate of 4% at the other site. Therefore, the overall rate of defects must be 3%.

· 7. Test Results. After being tested for the presence of a disease, a patient is told to think positively, so the patient is hoping for a positive test result.

· 8. Test Results. When taking a test for pregnancy, a true negative result is the same as a true positive result.

Concepts and Applications

· 9. Batting Percentages. The table below shows the batting records of two baseball players in the first half (first 81 games) and second half of a season.

Player

First half

50

10

50

Player

Second half

Hits

At-bats

Batting average

Josh

35

70

Jude

70

150

Hits

At-bats

Batting average

Josh

150

0.333

Jude

0.200

0.500

0.467

· Who had the higher batting average in the first half of the season? Who had the higher batting average in the second half? Who had the higher overall batting average?

· 10. Passing Percentages. The table below shows the passing records of two rival quarterbacks in the first half and second half of a football game.

Player

First half

Attempts

8

20

40%

2

6

Player

Second half

Completions

Attempts

Percentage

Allan

3

6

Abner

12

25

Completions

Percentage

Allan

Abner

33%
50%

48%

· Who had the higher completion percentage in the first half? Who had the higher completion percentage in the second half? Who had the higher overall completion percentage?

· 11. Test Scores. The table below shows eighth-grade mathematics test scores in Nebraska and New Jersey. The scores are separated according to the race of the student. Also shown are the state averages for all races.

 

250

White

Nonwhite

Average for all races

Nebraska

281

277

New Jersey

283

252

272

· Source: National Assessment of Educational Progress, from Chance magazine.

· a. Which state had the higher scores in both racial categories? Which state had the higher overall average across both racial categories?

· b. Explain how a state could score lower in both categories and still have a higher overall average.

· c. Now consider the table below, which gives the percentages of whites and nonwhites in each state. Use these percentages to verify that the overall average test score in Nebraska is 277, as claimed in the first table.

 

White

Nonwhite

Nebraska

New Jersey

87%

13%

66%

34%

· d. Use the racial percentages to verify that the overall average test score in New Jersey is 272, as claimed in the first table.

· 12. Test Scores. Consider the following table comparing the grade point averages (GPAs) and mathematics SAT scores of high school students in 1988 and 1998 (before the SAT test format was revised).

1988

1998

4

7

11

15

A−A−cap a minus

13

16

53

48

−3−3negative 3

19

14

−3−3negative 3

 

 

GPA

Percentage of students

SAT score

Change

1988

1998

A+A+cap a plus

632

629

−3−3negative 3

A

586

582

−4−4negative 4

556

554

−2−2negative 2

B 490

487

C

431

428

Overall average

504

514

+10+10plus 10

· Source: Cited in Chance, Vol. 12, No. 2, 1999, from data in the New York Times, September 2, 1999.

· a. In general terms, how did the SAT scores of the students in the five grade categories change between 1988 and 1998?

· b. How did the overall average SAT score change between 1988 and 1998?

· c. How is this an example of Simpon’s paradox?

·

4 Describing Data
 > 
4.4 Statistical Paradoxes
 > 
Section 4.4 Exercises

· 13. Tuberculosis Deaths. The following table shows deaths due to tuberculosis (TB) in New York City and Richmond, Virginia, in 1910.

White

Nonwhite

500

Total

Race

Population

TB deaths

White

Nonwhite

Total

Race

New York City

Population

TB deaths

4,675,000

8400

92,000

4,767,000

8900

Richmond

81,000

130

47,000

160

128,000

290

· Source: Cohen and Nagel, An Introduction to Logic and Scientific Method, Harcourt, Brace and World, 1934.

· a. Compute the death rates for whites, nonwhites, and all residents in New York City.

· b. Compute the death rates for whites, nonwhites, and all residents in Richmond.

· c. Explain why the results might seem surprising or paradoxical, and how the paradox arises.

· 14. Weight Training. Two cross-country running teams, called the Gazelles and the Cheetahs, participated in a (hypothetical) study in which 50% of the Gazelles and 65% of the Cheetahs used weight training to supplement a running workout. The remaining runners did not use weight training. At the end of the season, the mean improvement in race times (in seconds) for each team was as shown in the table below.

 

 

Mean improvement in race times

Weight training

No weight training

Team average

Gazelles

10 sec

2 sec

6.0 sec

Cheetahs

9 sec

1 sec

6.2 sec

· Describe how the results recorded in the table might seem surprising or paradoxical.

· 15. Basketball Records. Consider the following hypothetical basketball records for Spelman and Morehouse Colleges.

 

Spelman College

Morehouse College

Home games

10 wins, 19 losses

9 wins, 19 losses

Away games

12 wins, 4 losses

56 wins, 20 losses

· a. Give numerical evidence to support the claim that Spelman College has a better team than Morehouse College.

· b. Give numerical evidence to support the claim that Morehouse College has a better team than Spelman College.

· c. Which claim do you think makes more sense? Why?

· 16. Better Drug. Two drugs, A and B, were tested on a total of 2000 patients, half of whom were women and half of whom were men. Drug A was given to 900 patients and Drug B to 1100 patients. The results appear in the following table.

 

Women

Men

Drug A

5 of 100 cured

400 of 800 cured

Drug B

101 of 900 cured

196 of 200 cured

· a. Give numerical evidence to support the claim that Drug B is more effective than Drug A.

· b. Give numerical evidence to support the claim that Drug A is more effective than Drug B.

· c. Which claim do you think makes more sense? Why?

· 17. Polygraph Test. The results in the table below are from experiments conducted by researchers Charles R. Honts (Boise State University) and Gordon H. Barland (Department of Defense Polygraph Institute). In each case, it was known whether the subject lied, so the table indicates when the polygraph test was correct.

 

 

15

42

32

9

Did the subject actually lie?

No

Yes

Polygraph test indicated that the subject 
lied
 .

Polygraph test indicated that the subject did 
not lie
 .

· a. Based on the test results, how many subjects appeared to be lying? Of these, how many were actually lying and how many were telling the truth? What percentage of those who appeared to be lying were not actually lying?

· b. Based on the test results, how many subjects appeared to be telling the truth? Of those, how many were actually telling the truth? What percentage of those who appeared to be telling the truth were actually truthful?

· 18. Disease Test. Suppose a test for a disease is 80% accurate for those who have the disease (true positives) and 80% accurate for those who do not have the disease (true negatives). Within a sample of

4000

patients, the incidence rate of the disease matches the national average, which is 1.5%.

 

Total

48

12

Total

60

Disease

No disease

Test positive

788

836

Test negative

3152

3164

3940

4000

· a. Of those with the disease, what percentage test positive?

· b. Of those who test positive, what percentage have the disease? Compare this result to the one from part (a), and explain why they are different.

· c. Suppose a patient tests positive for the disease. As a doctor using this table, how would you describe the patient’s chance of actually having the disease? Compare this figure to the overall incidence rate of the disease.

Further Applications

· 19. Hiring Statistics. (This problem is based on an example in the column “Ask Marilyn” in Parade magazine.) A company decided to expand, so it opened a factory, generating 455 jobs. For the 70 white-collar positions, 200 males and 200 females applied. Of the females who applied, 20% were hired, while only 15% of the males were hired. Of the 400 males applying for the blue-collar positions, 75% were hired, while 85% of the 100 females who applied were hired. How does looking at the white-collar and blue-collar positions separately suggest a hiring preference for women? How do the combined white-collar and blue-collar hirings suggest a hiring preference for men?

·

4 Describing Data
 > 
4.4 Statistical Paradoxes
 > 
Section 4.4 Exercises

· 20. Drug Trials. (This problem is based on an example from the column “Ask Marilyn” in Parade magazine.) A company runs two trials of two treatments for an illness. In the first trial, Treatment A cures 20% of the cases (40 out of 200) and Treatment B cures 15% of the cases (30 out of 200). In the second trial, Treatment A cures 85% of the cases (85 out of 100) and Treatment B cures 75% of the cases (300 out of 400). Which treatment had the better cure rate in the two trials individually? Which treatment had the better overall cure rate? Explain why the results might seem surprising or paradoxical, and how the paradox can be resolved.

· 21. HIV Risks. The New York State Department of Health estimates a 10% rate of HIV infection among the at-risk population and a 0.3% rate in the general population. Tests for HIV are 95% accurate in detecting both true negatives and true positives. Random selection and testing of 5,000 at-risk people and 20,000 people from the general population results in the following table.

 

Test positive

Test negative

25

225

 

Test positive

Test negative

Infected

57

3

Not infected

At-risk population

Infected

475

Not infected

4,275

General population

997

18,943

· a. Verify that incidence rates for the general and at-risk populations are 0.3% and 10%, respectively. Also verify that detection rates for the general and at-risk populations are 95%.

· b. Consider the at-risk population. Of those with HIV, what percentage test positive? Of those who test positive, what percentage have HIV? Explain why these two percentages are different.

· c. Suppose a person in the at-risk category tests positive for HIV. As a doctor using this table, how would you describe the patient’s chance of actually having HIV? Compare this figure with the overall incidence rate of HIV infection.

· d. Consider the general population. Of those with HIV, what percentage test positive? Of those who test positive, what percentage have HIV? Explain why these two percentages are different.

· e. Suppose a person in the general population tests positive for HIV. As a doctor using this table, how would you describe the patient’s chance of actually having HIV? Compare this figure to the overall incidence rate of HIV infection.

 PROJECTS FOR THE INTERNET & BEYOND

· 22. Polygraph Arguments. Visit websites either opposing or supporting the use of polygraph tests. Summarize the arguments on both sides, specifically noting the role that false negative rates play in the discussion.

· 23. Drug Testing. Explore the issue of drug testing either in the workplace or in athletic competitions. Discuss the legality of drug testing in these settings and the accuracy of the tests that are commonly conducted.

· 24. Cancer Screening. Investigate recommendations concerning routine screening for some type of cancer (for example, breast cancer, prostate cancer, or colon cancer). Explain how the accuracy of the screening test is measured. How is the test useful? How can its results be misleading?

IN THE NEWS

· 25. Polygraphs. Find a recent article in which someone or some group proposes a polygraph test to determine whether a person is being truthful. In light of what you know about polygraph tests, do you think the results will be meaningful? Why or why not?

· 26. Drug Testing and Athletes. Find a news report concerning drug testing of athletes. Summarize how the testing is being used, and discuss whether the testing is reliable.

4 Describing Data
 > 

Chapter 4 Review Exercises

Chapter 4 Review Exercises

· 1. Chocolate Chips. Listed below are counts of the numbers of chocolate chips in two different types of cookies.

22

22

26

24

23

27

25

20

24

26

13

24

18

16

21

20

14

20

18

12

Chips Ahoy (regular)

Chips Ahoy (reduced fat)

· a. Find the mean and median for each of the two data sets.

· b. Find the range and standard deviation for each of the two data sets.

· c. Use the same scale to construct a boxplot for each of the two data sets.

· d. Apply the range rule of thumb to estimate the standard deviation of each of the two data sets. How well does the rule work in each case? Briefly discuss why it does or does not work well.

· e. Based on all your results, compare and discuss the two data sets in terms of their center and variation. Does there appear to be a difference between the numbers of chocolate chips in Chips Ahoy regular cookies and Chips Ahoy reduced fat cookies?

· 2. Combine the two data sets from Review Exercise 1 and find the following:

· a. The mode

· b. The percentile for 20 chocolate chips

· 3.

· a. What is the standard deviation for a data set of 50 values, all of which are the same?

· b. Which of the following two car batteries would you prefer to buy, and why?

· • One taken from a population with a mean lifetime of 48 months and a standard deviation of 2 months

· • One taken from a population with a mean lifetime of 48 months and a standard deviation of 6 months

· c. If an outlier is included in a sample of 50 values, what is the effect of the outlier on the mean?

· d. If an outlier is included in a sample of 50 values, what is the effect of the outlier on the median?

· e. If an outlier is included in a sample of 50 values, what is the effect of the outlier on the range?

· f. If an outlier is included in a sample of 50 values, what is the effect of the outlier on the standard deviation?

Chapter 4 Quiz

· 1. When you add the earthquake magnitudes 2.45, 3.62, 3.06, 3.30, and 1.09, then divide by the number of values, the result is 2.704. Which term best describes this value: average, mean, median, mode, or standard deviation?

· 2. Find the median of the magnitudes given in Exercise 1.

· 3. What is the range of the magnitudes given in Exercise 1?

· 4. The standard deviation of the magnitudes given in Exercise 1 is 0.999. What characteristic does that value measure?

· 5. Use the range rule of thumb to estimate the standard deviation of the earthquake magnitudes given in Exercise 1. How close is the result to the actual standard deviation of 0.999?

· 6. Find the standard deviation of these earthquake magnitudes: 2.99, 2.58, 2.44, 2.91, 3.38.

· 7. A histogram is constructed for a large set of pulse rates of adult males, and it is found that the distribution is symmetric and unimodal. What does this imply about the values of the mean and median?

· 8. Which of the following statements could apply to a data set consisting of 1000 values that are all different?

· a. The 20th percentile is greater than the 30th percentile.

· b. The median is greater than the first quartile.

· c. The third quartile is greater than the first quartile.

· d. The mean is equal to the median.

· e. The range is zero.

· 9. Which of the following statistics would be best for monitoring the consistency of the weights of Dunkin’ Donuts regular doughnuts: mean, median, mode, range, standard deviation?

· 10. Identify the names of the components that constitute the five-number summary for a data set.

·
4 Describing Data
 > 
Chapter 4 Review Exercises

·  Focus On The Stock Market: What’s Average about the Dow?

·

· As “averages” go, this one is extraordinary. You can’t watch the news without hearing what happened to it, and many people spend hours tracking it each day. It is by far the most famous indicator of stock market performance. We are talking, of course, about the Dow Jones Industrial Average, or DJIA for short. But what exactly is it?

· The easiest way to understand the DJIA is by looking at its history. As the modern industrial era got under way in the late 19th century, most people considered stocks to be dangerous and highly speculative investments. One reason was a lack of regulation that made it easy for wealthy speculators, unscrupulous managers, and corporate raiders to manipulate stock prices. But another reason was that, given the complexities of daily stock trading, even Wall Street professionals had a hard time figuring out whether stocks in general were going up (a “bull market”) or down (a “bear market”). Charles H. Dow, the founder (along with Edward D. Jones) and first editor of the Wall Street Journal, believed he could rectify this problem by creating an “average” for the stock market as a whole. If the average was up, the market was up, and if the average was down, the market was down.

· To keep the average simple, Dow chose 12 large corporations to include in his average. On May 26, 1896, he added the stock prices of these 12 companies and divided by 12, finding a mean stock price of $40.94. This was the first value for the DJIA. As Dow had hoped, it suddenly became easy for the public to follow the market’s direction just by comparing his average from day to day, month to month, or year to year.

· The basic idea behind the DJIA is still the same, although the list now includes 30 stocks rather than 12; the list is selected by the editors of the Wall Street Journal, who occasionally change the stocks on the list. However, the DJIA is no longer the mean price of its 30 stocks. Instead, it is calculated by adding the prices of its 30 stocks and dividing by a special divisor. Because of this divisor, we now think of the DJIA as an index that helps us keep track of stock values, rather than as an actual average of stock prices.

· The divisor is designed to preserve continuity in the underlying value represented by the DJIA, and it therefore must change whenever the list of 30 stocks changes or when a company on the list has a stock split. A simple example shows why the divisor must change when the list changes. Suppose the DJIA consisted of only 2 stocks (rather than 30): Stock A with a price of $100 and Stock B with a price of $50. The mean price of these two stocks is ($100+$50)/2=$75.($100+$50)/2=$75. open . dollars 100 plus dollars 50 . close . slash 2 equals dollars 75 . Now, suppose that we change the list by replacing Stock B with Stock C, whose price is $200. The new mean is ($100+$200)/2=$150,($100+$200)/2=$150, open . dollars 100 plus dollars 200 . close . slash 2 equals dollars 150 comma so merely replacing one stock on the list would raise the mean price from $75 to $150. Therefore, to keep the “value” of the DJIA constant when we change this list, we must divide the new mean of $150 by 2. In this way, the DJIA has a value of 75 both before and after the list change, but we can no longer think of this 75 as a mean price in dollars.

· To see why a stock split changes the divisor, again suppose the index consists of just two stocks: Stock X at $100 and Stock Y at $50, for a mean price of $75. Now, suppose Stock X undergoes a two-for-one stock split, so that its new price is $50. With both stocks now priced at $50, the mean price after the stock split would also be $50. In other words, even though a stock split does not affect a company’s total value (it only changes the number and prices of its shares), we’d find a drop in the mean price from $75 to $50. In this case, we can preserve continuity by dividing the new mean of 50 by 2/3 (which is equivalent to multiplying by 3/2) so that the DJIA holds at 75 both before and after the stock split.

· Just as in these simple examples, the real divisor changes with every list change or stock split, so it has changed many times since Charles Dow first calculated the DJIA as an actual mean. The current value of the divisor is published daily in the Wall Street Journal.

· Given that there are now well over 10,000 actively traded stocks, it might seem remarkable that a sample of only 30 could reflect overall market activity. But today, when computers make it easy to calculate stock market “averages” in many other ways, we can look at historical data and see that the DJIA has indeed been a reliable indicator of overall market performance. Figure 4.17 shows annual highs for the DJIA from 1965 through 2015.

· If you study Figure 4.17 carefully, you may be tempted to think that you can see patterns that would allow you to forecast precise values of the market in the future. Unfortunately, no one has ever found a way to make reliable forecasts, and most economists now believe that such forecasts are impossible.

4 Describing Data
 > 
Chapter 4 Review Exercises

Figure 4.17 Annual high values of the Dow Jones Industrial Average, 1965–2015.

Source: Dow Jones & Company.

The futility of trying to forecast the market is illustrated by the story of the esteemed Professor Benjamin Graham, often called the father of “value investing.” In the spring of 1951, one of his students came to him for some investment advice. Professor Graham noted that the DJIA then stood at 250, but that it had fallen below 200 at least once during every year since its inception in 1896. Because it had not yet fallen below 200 in 1951, Professor Graham advised his student to hold off on buying until it did. Professor Graham presumably followed his own advice, but the student did not. Instead, the student invested his “about 10 thousand bucks” in the market right away. As it turned out, the market never did fall below 200 in 1951 or any time thereafter. And the student, named Warren Buffet, became a billionaire many times over.

Questions for Discussion

· 1. The stock market is still considered a riskier investment than, say, bank savings accounts or bonds. Nevertheless, financial advisors almost universally recommend holding at least some stocks, which is quite different from the situation that prevailed a century ago. What role do you think the DJIA played in building investors’ confidence in the stock market?

· 2. The DJIA is only one of many different stock market indices in wide use today. Briefly look up a few other indices, such as the S&P 500, the Russell 2000, and the NASDAQ. How do these indices differ from the DJIA? Do you think that any of them should be considered more reliable indicators of the overall market than the DJIA? Why or why not?

· 3. The 30 stocks in the DJIA represent a sample of the more than 10,000 actively traded stocks, but it is not a random sample because it is chosen by particular editors for particular reasons that may include personal biases. Suppose that you chose a random sample of 30 stocks and tracked their prices. Do you think that such a random sample would track the market as well as the stocks in the DJIA? Why or why not?

· 4. Create your own “portfolio” of 10 stocks that you’d like to own, and assume you own 100 shares of each. Calculate the total value of your portfolio today, and track price changes over the next month. At the end of the month, calculate the percent change in the value of your portfolio. How did the performance of your portfolio compare to the performance of the DJIA during the month? If you really owned these stocks, would you continue to hold them or would you sell? Explain.

·
·
4 Describing Data
 > 
Chapter 4 Review Exercises

·  Focus On Economics: Are the Rich Getting Richer?

·

· The idea that income inequality has been growing, meaning that the rich are getting richer while the rest of us are left behind, has been a major topic in recent political debates. But is it true?

· If we wish to draw general conclusions about how the average person is faring compared to the rich, we must look at the overall income distribution. Economists have developed a number, called the Gini Index, that is used to describe the level of equality or inequality in the income distribution. The Gini Index is defined so that it can range only between 0 and 1. A Gini Index of 0 indicates perfect income equality, in which every person has precisely the same income. A Gini Index of 1 indicates perfect inequality, in which a single person has all the income and no one else has anything. Figure 4.18 shows the Gini Index in the United States since 1947. Note that the Gini Index fell from 1947 to 1968, indicating that the income distribution became more uniform during this period. The Gini Index has generally risen ever since, indicating that the rich are, indeed, getting richer.

·

· Figure 4.18 Gini Index for families and households, 1947–2015. Household data, which include single people and households in which the members are not part of the same family, have been collected only since 1967. The dashed segments for 1992–1993 indicate a change in the methodology for data collection, so the corresponding rise in the Gini Index may be partially or wholly due to this change rather than a real change in income inequality.

· Source: Adapted from data by the U.S. Census Bureau.

· Although the Gini Index provides a simple single-number summary of income inequality, the number itself is fairly difficult to interpret (and to calculate). An alternative way to look at the income distribution is to study income quintiles, which divide the population into fifths by income. Often, the highest quintile is further broken down to show how the top 5% of income earners compares with others.

· Figure 4.19 shows the share of total income received by each quintile and by the top 5% in the United States in different decades. The height of each bar (the number on top of it) represents the share of total income. For example, the 3.1 on the bar for the lowest quintile in 2015 means that the poorest 20% of the population received only 3.1% of the total income in the United States. Similarly, the 51.2 on the bar for the highest quintile in 2015 means that the wealthiest 20% of the population received 51.2% of the income. Note also that the wealthiest 5% received 21.9% of the income—nearly double the total income of the poorest 40% of the population. If you study this graph carefully, you’ll see that the share of income earned by the first four quintiles—which means all but the wealthiest 20% of the population—dropped since 1975. Meanwhile, the share earned by the wealthiest 20% rose substantially, as did the share of the top 5%. In other words, this graph also confirms that the rich have been getting richer compared to most of the population.

4 Describing Data
 > 
Chapter 4 Review Exercises

 
d

Figure 4.19 Share of total household income by quintile (and for top 5%) at 10-year intervals.

Source: U.S. Census Bureau.

Now that we’ve established that the rich are getting richer, the next question is whether it matters. Most people, including most economists, have traditionally assumed that rising income inequality is bad for democracies. From this point of view, growing income inequality would be a significant problem. But not all economists agree. One argument they make points to a widely accepted ethical condition called the Pareto criterion, after the Italian economist Vilfredo Pareto (for whom Pareto charts are also named): Any change is good if it makes someone better off without making anyone else worse off. These economists argue that the Pareto criterion has been satisfied because overall growth in the U.S. economy has led to rising living standards for people of all income levels, implying that few people have been harmed by the rise in inequality. A related argument focuses on income mobility. For example, as recently as 1980, 60% of the “Forbes 400” (the richest 400 people in the United States) had inherited most of their wealth. Today, less than 20% of the Forbes 400 represents old money. The implication is that while you had to be born rich in the past, today you can become rich by getting educated and working hard. Surely, it is a good thing to encourage education and hard work.

It’s also worth noting that while overall income inequality in the United States has increased, the income inequality among different races and between men and women has decreased. In other words, it is now easier than it was in the past for African Americans, Hispanic Americans, and women to earn as much as white males, though there is still a gap.

Questions for Discussion

· 1. Compare several different ways of looking at the data shown in Figure 4.18 and Figure 4.19. For example, does one seem to indicate a larger change in income inequality than the other? Can you think of other possible ways to display income data that might give a different picture than those shown here?

· 2. Do you agree that the Pareto criterion is a good way to evaluate the ethics of economic change? Why or why not?

· 3. Overall, do you think the increase in income inequality has been a good or bad thing for the United States? Will it be good if the trend continues? Defend your opinion.

· 4. Although economic data suggest that the vast majority of Americans are better off today than they were a few decades ago, the poorest Americans still live in difficult economic conditions. What do you think can or should be done to help improve the lives of the poor? Can your suggestion be implemented without harming the overall economy? Explain.

Calculate your order
Pages (275 words)
Standard price: $0.00
Client Reviews
4.9
Sitejabber
4.6
Trustpilot
4.8
Our Guarantees
100% Confidentiality
Information about customers is confidential and never disclosed to third parties.
Original Writing
We complete all papers from scratch. You can get a plagiarism report.
Timely Delivery
No missed deadlines – 97% of assignments are completed in time.
Money Back
If you're confident that a writer didn't follow your order details, ask for a refund.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00
Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.
Power up Your Study Success with Experts We’ve Got Your Back.

Order your essay today and save 30% with the discount code ESSAYHELP