The Chi-Square Test

Suppose you want to test the hypothesis (derived from some theory) that, in general, college men are more satisfied with their social life than college women. It would be a big pain to measure the satisfaction of every single college person, but you could do it with a sample of people. Basically, you get a list of all college students, you randomly select a certain number (say, 1000), you call them up and ask two questions: 1. 'What sex are you?'. 2. Are you satisfied with your social life? Yes or no.

Now, according to your hypothesis, the number of men saying 'yes' should be higher than the number of women. Right? Well, not quite. What if the sample has more men in it than women? Then, even if men are unhappier than women, there will still be more 'yeses' among the men simply because there are more of them. The solution, obviously, is to use percentages. The hypothesis is that the percent of men saying 'yes' should be higher than the percent of women who say 'yes'.

Important note: it doesn't matter whether, in absolute terms, it's a large or small percentage of men that say 'yes', as long as it's bigger than the women's. For example, suppose 35% of the men say they are satisfied, while only 15% of the women are. The hypothesis has been supported, even though most men actually said they were dissatisfied.

Now, suppose we did the above, interviewing 100 men and 100 women, and got this result:

Men Women
Satisfied: 30 20

Is the hypothesis supported? Well, maybe. It certainly was for the sample of 200. But it was just a sample. It wasn't everybody. What if we did the study all over again, choosing a different sample of 200 people. Would we get the same result? It's unlikely to come out exactly the same. Think about flipping a coin. The chance of it coming up heads is 50% right? So if you flip it ten times, it should come up heads 5 times, right? But it doesn't have to. It is possible to get 6 heads. It is possible to get 7 heads. It is even possible to get 10 heads in a row. It has happened. It may be unlikely, but it can happen.

So suppose that out there in America, there really is no difference in the satisfaction levels between male and female college students. Actually, 25% of all college students, male or female, are satisfied with their social lives. Now we come along and interview 100 students of each sex at random. "At random" means that we essentially flipped a many-sided coin to see which of all the college males in America we would interview. Each male student had an equal chance of being chosen. If there are 1,000,000 college males in the population, then each one had the same chance of being chosen (1 in 10,000). And since 25% of them are satisfied, we should have gotten 25 that were satisfied and 75 that were not. But if we can flip a coin 10 times and not get exactly 5 heads, what are the chances we can pick 100 males out of million and get exactly 25 that are satisfied? By chance alone, we could have gotten 30. Or 40. Or even 100. After all, there are 25,000 satisfied males out there. You could easily have picked 100 of them.

And the same is true of the women. Our sample of 100 could have easily included 30 satisfied women, or just 10. So when we get a result like 30% of the men sampled are satisfied while only 20% of the women were satisfied, we cannot be sure that the difference of 10% points is not due to chance. This is what is called sampling error, although it is not really an error. After all, when the coin gives you 6 heads instead of 5, do you consider it an error?

So what we would like to have is a sense for the probability of getting a difference like 10% points by chance alone. In other words, we would like to know what the chances are of getting a 10% difference when, in the population, there is actually no difference in the population (i.e., the 10% difference we observe is due to sampling variation rather than a difference in the population).

That's what significance tests are for. What a significance test does is give you that probability, which is often called a "p-value". A p-value is the probability of observing a difference as large as you actually observed, given that the null hypothesis of no difference is actually true in the population. You can think of it loosely as the probability that you are wrong when you claim that men and women college students have different rates of satisfaction.

So what's a decent p-value? Typically, people say that if the p-value is less than 0.05 (i.e., 5 percent), then that's good enough: it is significant. If the p-value is larger than 0.05, we say it is non-significant. If the p-value is lower than 0.001, we say it is highly significant. What it means is that if the p-value is less than 5%, then it is low enough that we are willing to take the chance that we might be wrong.


Calculating the p-value

Although there is a simpler way to calculate the p-value than the one I'm going to show you, the one we will use is better because we can generalize it to other situations. That way, you only need to learn one statistic instead of several different ones. The method I'm going to teach is called the chi-square test of independence.

The basic idea is to compare the frequencies you got with the frequencies that you would expect to get if there was no difference in satisfaction between men and women in the population. That means that the variable "sex" and the variable "satisfaction with social life" are unrelated or independent of each other.

So let's review basic probability theory for a moment. If two events are independent, then the probability of both occurring at the same time can be calculated by multiplying the probability of one by the probability of the other. So if we flip two coins at the same time, the probability that both will come out heads is 0.5 x 0.5 = 0.25. It should happen about a quarter of the time. Similarly, suppose we flipped a coin and rolled a die (the singular of "dice") at the same time. What's the probability of getting a head and a 4 at the same time? Well it's just 1/2 times 1/6 = 1/12.

Therefore, if a person's sex has nothing to do with whether they are satisfied with their social life, then the proportion of males who are satisfied should be equal to the probability of being male (i.e., what percentage of your sample was male) times the probability of being satisfied (i.e., what percentage of your sample was satisfied, regardless of sex).

So, just to make things interesting, let's assume that in our sample of 200, we obtained, by the luck of the draw, 120 females and 80 males, and 25% of the men said they were satisfied while 20% of the women said the same. We can arrange the information in a pair of tables:


  Males Females Sampled
Yes 20 24 44
No 60 96 156
Sampled 80 120 200

Column Percentages

  Males Females Sampled
Yes 25% 20% 22%
No 75% 80% 78%
Sampled* 40% 60% 100%

* The last row are not column percentages,
but rather percentages of the whole sample

So if sex is independent of satisfaction, the probability of sampling a satisfied male is 0.4x0.22 = 0.088. So we would have expected that 8.8 percent of the total sample, or between 17 and 18 people, would be satisfied males (see table below). In other words I multiple 0.088 by the total sample size (200) to get the expected frequency of satisfied males. Similarly, the probability of getting a satisfied female is 0.6 x .22 = 0.132. So we would have expected that 13.2 percent of the sample, or about between 26 and 27 people, to be satisfied females. We carry out the same calculations for the unsatisfied males and the unsatisfied females.

Calculating Expected Frequencies
(given assumption of independence)

  Males Females Sampled
Yes .4*.22*200 = 17.6 .6*.22*200 = 26.4 44
No .4*.78*200 = 62.4 .6*.78*200 = 93.6 156
Sampled 80 120 200

Expected Frequencies
(given assumption of independence)

  Males Females Sampled
Yes 17.6 26.4 44
No 62.4 93.6 156
Sampled 80 120 200

Computationally, instead of bothering to compute the proportion of males (.4) and females (.6) and YESes (.22) and NOs (.78) and then multiplying the appropriate ones together and multiplying the by sample size (200), it is faster to work directly from the raw frequencies and multiply the row sums times the column sums divided by the sample size as follows:

Expected Frequencies
(given assumption of independence)

  Males Females Sampled
Yes 44*80/200 = 17.6 44*120/200 = 26.4 44
No 156*80/200 = 62.4 156*120/200 = 93.6 156
Sampled 80 120 200

Now we compare the expected frequencies with the observed frequencies. For each cell in the table, we take the observed frequency, subtract the expected value, square the difference, and divide it by the expected value. Then we sum all those quantities to get a single value called chi-square. In symbols, we calculate


for each of the four cells in the table and add them up.

(given assumption of independence)

  Males Females Sampled
Yes 2.4 -2.4 0
No 2.4 2.4 0
Sampled 0 0 0

Squared Differences
(given assumption of independence)

  Males Females Sampled
Yes 5.76 5.76 11.52
No 5.76 5.76 11.52
Sampled 11.52 11.52 23.04

Squared Differences Divided By Expected
(given assumption of independence)

  Males Females Sampled
Yes 0.327 0.218 0.545
No 0.092 0.062 0.154
Sampled 0.42 0.28 0.7

For our data, the value of chi-square is about 0.7. This is an unusually small value (I'll explain what it means in a moment). Often, chi-square values are numbers like "32.87" or "116.2". The smallest chi-square value possible is 0, but there is no upper bound: it depends on the size of the numbers.

Notice that the less the difference between observed and expected, the smaller the value of chisquare will be. Chi-square is zero only when there is absolutely no difference between the observed and the expected. So when will chi-square be small? Whenever the sample data are consistent with the null hypothesis that there is no difference in satisfaction between males and females in the population. So if we were hoping that there really was a difference between males and females, we want chisquare to be large.

But how large is large? The maximum value of chi-square is ... big. What we want to know is what is the probability of getting a chi-square as large as actually observed, given that in the population the variables are independent of each other. The probabilities are given in a chi-square table such as this one:

Table of Chi-Square Values

df \ P 0.1 0.050 0.025 0.010 0.005 0.001
1 2.7055 3.8414 5.0238 6.6349 7.8794 10.828
2 4.6051 5.9914 7.3777 9.2103 10.5966 13.816
3 6.2513 7.8147 9.3484 11.3449 12.8381 16.266
4 7.7794 9.4877 11.1433 13.2767 14.8602 18.467
5 9.2363 11.0705 12.8325 15.0863 16.7496 20.515
6 10.6446 12.5916 14.4494 16.8119 18.5476 22.458
7 12.0170 14.0671 16.0128 18.4753 20.2777 24.322
8 13.3616 15.5073 17.5346 20.0902 21.9550 26.125
9 14.6837 16.9190 19.0228 21.6660 23.5893 27.877
10 15.9871 18.3070 20.4831 23.2093 25.1882 29.588
11 17.2750 19.6751 21.9200 24.7250 26.7569 31.264
12 18.5494 21.0261 23.3367 26.2170 28.2995 32.909
13 19.8119 22.3621 24.7356 27.6883 29.8194 34.528
14 21.0642 23.6848 26.1190 29.1413 31.3193 36.123
15 22.3072 24.9958 27.4884 30.5779 32.8013 37.697
16 23.5418 26.2962 28.8454 31.9999 34.2672 39.252
17 24.7690 27.5871 30.1910 33.4087 35.7185 40.790
18 25.9894 28.8693 31.5264 34.8058 37.1564 42.312
19 27.2036 30.1435 32.8523 36.1908 38.5822 43.820
20 28.4120 31.4104 34.1696 37.5662 39.9968 45.315
21 29.6151 32.6705 35.4789 38.9321 41.4010 46.797
22 30.8133 33.9244 36.7807 40.2894 42.7956 48.268
23 32.0069 35.1725 38.0757 41.6384 44.1813 49.726
24 33.1963 36.4151 39.3641 42.9798 45.5585 51.179
25 34.3816 37.6525 40.6465 44.3141 46.9278 52.620
26 35.5631 38.8852 41.9232 45.6417 48.2899 54.052
27 36.7412 40.1133 43.1944 46.9680 49.6449 55.476
28 37.9159 41.3372 44.4607 48.2782 50.9933 56.892
29 39.0875 42.5569 45.7222 49.5879 52.3356 58.302
30 40.2560 43.7729 46.9792 50.8922 53.6720 59.703
40 51.8050 55.7585 59.3417 63.6907 66.7659 73.402
50 63.1671 67.5048 71.4202 76.1539 79.4900 86.661
60 74.3970 79.0819 83.2976 88.3794 91.9517 99.607
70 85.5271 90.5312 95.0231 100.425 104.215 112.317
80 96.5782 101.879 106.629 112.329 116.321 124.839
90 107.565 113.145 118.136 124.116 128.299 137.208
100 118.498 124.342 129.561 135.807 140.169 149.449

The columns of the table correspond to p-values. In general, we look down the column marked "0.05", because we use 0.05 as the conventional cut-off level of statistical significance. In choosing the .05 level, we are saying that if the probability of a certain result occurring just because of sampling variation is greater than 5%, then we are not willing to assume that the results are real (i.e., that sex and satisfaction are associated in the population from which the sample was drawn).

The cells of the table correspond to chi-square values (such as the 0.7 we computed above). The rows correspond to degrees of freedom. To calculate degrees of freedom for a simple table such as we have, we use the following formula:

df = (R - 1) x (C - 1)

where R is the number of categories in the Satisfaction variable, and C is the number of categories in the Sex variable. In our case, both variables have 2 categories, so the degrees of freedom is 1 x 1 = 1.

Now we look at the first row of the table (corresponding to 1 degree of freedom), and look down the 0.05 column. The value in the table is 3.8414. Comparing that to the 0.7 that we calculated, we see that our value is smaller than the value in the table. This means that the differences between observed and expected were relatively small -- so small, that it could have happened by chance (due to sampling variation) more than 5% of the time.

When something could occur by sampling variation more than 5% of the time, we call that "non-significant" and don't trust it, since there is a significant chance (5%) that it occurred solely because of the luck of the draw: a weird sample. Hence, it makes no sense to try to interpret it as a real difference in satisfaction between men and women. In fact, in this case, the 0.7 is smaller than the chi-square value in the "0.100" column as well, indicating that the difference we observed are likely to occur in more than 10% of samples drawn from a population in which there is actually no difference between males and females in satisfaction.

So we conclude that we cannot reject the null hypothesis of indepenence (no difference between sexes). In other words, the difference in percentages between males and females is so small that there might not be any difference in the population: it might easily have been a fluke of our sample.

Now suppose the chisquare value that we had computed from our data had been somewhat larger -- say, 7.9. Looking along the first row of the table, we see that it is just larger than the value under the "0.005" column. That means that such a large result would only occur by chance in one half of one percent of samples. In other words, it is really unlikely to have happened by chance, so in that case we would be willing to believe that there really is a difference between men and women in the population.