Significance Tests

Suppose you believe that, in general, American men are happier than women. You can't measure the happiness of every single American, but you can take a large sample that includes both men and women, give each person a "happiness test", then compare the average happiness of the men with the average happiness of the women. According to your hypothesis, the men's average should be higher than the women's.

And suppose you do this with a sample of 200 people (50% men) and find that the men's average is 4.74 and the women's average is 5.74 -- a difference of 1 point on a 7 point scale. Does this mean that the hypothesis was wrong? That in fact, in the US population as a whole, women are actually happier than men (and by about 1 point)? Well, it depends. First of all, the sample has to have been drawn using a probability sampling method like Simple Random Sampling. Second, because it is a sample, there is always a chance that the sample is just screwy -- really different from the population in general.

For example, suppose that in the population of 130 million women and 130 million men, that on average the men are happier than women, as you suggest. Suppose the actual population frequencies are:

Happiness Category	Men (millions)	Women (millions)
1 (Extremely Unhappy)	10	20
2	14	22
3	18	24
4 (neutral)	22	22
5	24	18
6	22	14
7 (Absolutely Ecstatic)	20	10
	--------------	--------------
Total:	130 million	130 million
Average:	4.4	3.6

The average happiness for the men is [(10x1)+(14x2)+(18x3)+(22x4)+(24x5)+(22x6)+(20x7)]/130 = 4.4. The average happiness of the women is [(20x1)+(22x2)+(24x3)+(22x4)+(18x5)+(14x6)+(10x7)]/130 = 3.6. But even though the men are happier on average, there are still 10 million men who are extremely unhappy, and millions more that are less than completely happy.

Now I come along and take a sample of 100 men and 100 women from the population of 130 million. By chance, it is possible that for my sample of women, the 100 names I draw out of a hat just happen to be people who are relatively happy: a lot of 5s, 6s and 7s. Similarly, the 100 men's names I draw from a hat could just happen to be relatively unhappy: a lot of 1s, 2s and 3s. If that happened, I would conclude, quite incorrectly, that American men are much less happy than women.

It is even possible to totally accidentally choose all 100 men from category 1 and all 100 women from category 7, giving an answer that is wildly different from the population truth. Using random sampling it's not very likely that this would happen, but it is possible.

The big question is: how far off is my sample likely to be? How much confidence can I have in the results?

Confidence Intervals

To explain this, it is easiest to forget about the difference between men and women, and just pick one sex. Suppose we want to know what the average height is of American women, but we have to estimate this from a sample. We get a random sample of 1000 women, and the average height is 65 inches. We think of that sample average as an estimate of the population average. How close is that estimate likely to be to the true population average?

Here is how to thinkg about this. A sample of 1000 people is itself an element from a population: the population of all possible samples of 1000 people. In a way, your sample is one drawing from the hat that contains all possible samples. Most of these samples have a nice mixture of heights in them. But some samples are really skewed. For example, there is one possible sample which consists of the 1000 tallest women in America. This sample gives the highest average height. There is another possible sample which consists of the 1000 shortest women in America. This sample gives the lowest average height. All the other samples yield an average height somewhere in between.

Suppose we could actually draw every possible sample, and for each one calculate the average height within the sample. If we then plotted the average heights of all samples in a histogram, it turns out that we would get a bell curve whose center would be (a) the average of all the sample averages, and (b) the population average.

Now, in a real research situation, you don't take all possible samples, you just take one. The question is, which one did you get? Is it one of the wierdos that gives an average height that is totally different from all the others? Or is it pretty close to the average?

Well, if you knew the standard deviation of the histogram of all possible samples, you could answer that easily. It's a bell-shaped curve, so we know that 95% of all the values will be within 2 standard deviations (plus or minus) of the mean. (Actually, it's 1.96 standard deviations, not 2, but that's a detail.) So if the standard deviation is 1.5 inches, that means that 95% of the samples will come within 3 inches of the actual (unknown) population average. [Here's the important part:] At the same time, this means that you have a 95% chance that your sample falls within that interval of ±3 inches from the population average. See, you picked your sample at random from among all the possible samples. You have no idea which sample you actually got. But 95% of all samples fall within 3 inches of the population average, so there is a 95% chance that your sample falls within 3" of the population average.

The Standard Error

Ok, but how do we know what the standard deviation of the histogram of all possible samples is? Well it turns out that this standard deviation (which is called the Standard Error or SE) is related to the standard deviation of the sample variable. Strange but true! The formula relating the two is this:

In other words, take the SD of the sample, divide by the square root of the sample size, and you have an estimate of the standard error (which is the standard deviation of the histogram of all possible samples).

A Detail

When the size of sample is small (a few hundred or less) we use a slightly different formula to calculate the SD. We call this SD⁺. You can compute SD⁺ from the regular SD as follows:

SD⁺ = sqrt[N/(N-1)] × SD

In other words, multiply the regular SD times the square root of N/(N-1), where N is the sample size.

So here are the steps to create a 95% confidence interval for a sample statistic, such as the average:

An Example:

Suppose we want to know the average GPA of Boston College students. We take a (very small!) sample of 5 students and get these results:

GPA	X - AVG	(X-AVG)²
2.11	-.77	0.5929
3.87	.99	0.9801
2.97	.9	0.8100
1.58	-1.3	1.6900
3.91	1.03	1.0609
-------	------	--------
2.88	0.0	0.8663

The average GPA of the sample is 2.88. The SD is sqrt(0.8663) = 0.931. The SD⁺ = sqrt(5/4)×0.931 = 1.04. The SE is therefore 1.04/sqrt(5) = 0.465. Since 1.96×SE is 0.9114, so we can be 95% confident that the average GPA for all Boston College students is between 1.97 and 3.79. In other words, 2.88 ± 0.9.

Notice that the 95% confidence interval in this case is very wide -- we hardly needed a sample and statistical analysis to guess that the average GPA is between 1.97 and 3.79! This is because the sample was so small. The larger the sample, the more accurate the estimate and the less wide the confidence interval.

You should also note that it is still possible that the school GPA is NOT between 1.97 and 3.79 -- that is, that it is larger or smaller than that. What are the chances? 5%. That's what a 95% confidence interval means.

Another Example:

Suppose a weight scale at the post office needs to be calibrated from time to time. You do this by weighing a standard weight that you know is exactly 70 grams. Of course, every time you measure something there is measurement error, so you weigh the thing 5 times and take an average. Here's the data:

The average is 77.8. The SD of the sample is the sqrt(52.16) = 7.22. The SD⁺ = sqrt(5/4)×7.22 = 8.07. The SE is 8.07/sqrt(5) = 3.61. The 1.96×SE is 7.07, so we can be 95% confident that the weight of the object according to this machine is 77.8±7 grams. In other words, it is somewhere between 70.8 and 84.8.

Notice that the real weight of the object is 70 grams. Since the 95% confidence interval does not include this value, we can conclude that the machine needs recalibration -- the 5 measurements we got are too far from 70 to be the result of chance.

1 ---	--- 2 ---	--- 3 ---	--- 4 ---	--- 5 ---	--- 6 ---	--- 7
Extremely Unhappy			Neither Happy nor Unhappy			Absolutely Ecstatic

Measured Values:	X - AVG	(X-AVG)²
78	0.2	0.04
83	5.2	27.04
68	-9.8	96.04
72	-5.8	33.64
88	10.2	104.04
-------	------	--------
77.8	0.0	52.16