Intro to Statistics: Part 11: Statistical Significance and Null Hypothesis Testing

Now that we've covered topics like sampling distributions and the Central Limit Theorem, we can take on the subject of statistical significance. An experimental result is said to be statistically significant if the likelihood of observing it by mere random chance (e.g. due to sampling error) is below some threshold (say, below p=0.05). By significant we mean the result is unlikely to be observed randomly, therefore it probably reflects a real (not random) effect in the data under study.

As a simple example, suppose you want to test whether a six-sided die is "fair". One way to do that is to roll it a bunch of times, record the results, then take the sample mean. If the die is fair, then the sample mean should be in the vicinity of the expected value of a six-sided die: 3.5. Suppose the sample mean turns out to be 5.1. There are two possible explanations for this skewed result:

The die is fair, and the skewed result is due merely to random chance -- the sample just happened to have lots of high rolls, for no reason other than sampling error
The die is not fair, it is biased toward high numbers

Statistical significance testing attempts to determine which explanation is more likely. It does this by assessing the likelihood of observing such a skewed sample, given the assumption that the skew was random. In other words: assuming the die is fair, what are the chances that our sample would produce such a skewed result?

Let's consider another example. Suppose somebody went around the world collecting data about people's heights and concluded that the average height across the world is 67 inches. You have a theory that Norwegian people are, on average, taller than the global average. How would you go about testing your theory?

Well, you could try to measure the height of every single Norwegian, but that would be impractical. So instead you collect a sample of Norwegians and measure their heights. You observe that the average height of the Norwegians in your sample is 68.5in. (I'm just making up these numbers btw).

This result may lead you to conclude that Norwegians are in fact taller than the global average. But like the fair die example, there are two possible explanations:

Norwegians are NOT taller than the global average, and the result you observed was skewed by sampling error
Norwegians are in fact taller than the global average

As #1 suggests, it's possible that your sample isn't representative of the overall Norwegian population. You might have randomly sampled a particularly tall group of Norwegians. So how do you decide which explanation is more likely? This is the question that statistical significance testing is designed to answer.

Testing against the null hypothesis

Statistical significance testing begins with a null hypothesis. A null hypothesis makes the claim that there ought to be no difference ("null") between the sample data and the population data. In other words, the sample result ought to be consistent with the characteristics of the population.

In the fair die example, the null hypothesis could be stated as: "This die is no different than a fair die (the 'null' assumption). Therefore the sample mean ought to be consistent with the expected value of a fair die, 3.5."

In the Norwegian heights example, the null hypothesis could be stated thusly: "Norwegians are no taller than the global average (the 'null' assumption). Therefore the Norwegian sample mean should be consistent with the global average height, 67in."

So we start off with a null hypothesis, which gives us a baseline for what the sample data should look like, assuming no effect (i.e. no difference from the population). We then compare our actual sample result against it, to assess how consistent the sample is with the assumption. If the sample result is not consistent with the assumption, then again we have two possibilities:

The difference is due to random chance; i.e. the effect is random, not real, and the null hypothesis is correct (or at the very least, cannot be rejected)
The difference is statistically significant; i.e. the effect is real, not random, and the null hypothesis is incorrect (or, more precisely, can be rejected)

Statistical significance testing helps us decide which explanation is more likely. It does this by estimating a probability for statement #1. The probability tells us how likely it is, assuming the null hypothesis is indeed correct, that we would observe such an effect in our sample data, due merely to sampling error. We estimate this probability by analyzing the sample in the context of its sampling distribution.

Recall that the sampling distribution of the mean is the distribution of sample means from multiple samples drawn from the underlying population. In the Norwegian heights example, we have a single sample mean, 68.5in, and an assumption (the null hypothesis) that the true mean of the underlying Norwegian population is 67in. Given this assumption, along with what we know about the Central Limit Theorem and sampling distributions, we can estimate the sampling distribution of the mean for the Norwegian population. Then we can locate where the sample mean falls within the sampling distribution. This gives us the probability of observing such a skewed sample mean, given the assumptions we've made about the true population (namely that the true population mean is 67in).

Estimating the sampling distribution of the mean

The sampling distribution of the mean is the distribution of sample means from multiple samples drawn from the underlying population, where each sample contributes a single outcome (its mean) to the sampling distribution. In this example, we don't have multiple samples from which to compute means and build a sampling distribution. But we can use the Central Limit Theorem to estimate the sampling distribution. According to the Central Limit Theorem, all we need to know to construct a sampling distribution is:

the sample size
the true mean of the population (the expected value of the sampling distribution)
the true variance of the population (which when divided by the sample size gives the variance of the sampling distribution)

We know the sample size, since that's a given. For this example let's say the sample size is N = 30. However we don't know the true mean or true variance of the population, since we haven't measured every single Norwegian in the population. So we have to either assume or estimate those values. Under the null hypothesis, we assume the true population mean is 67in. However we haven't made any assumptions about the population variance, so we must estimate that value.

The best estimate we have is from our sample data, so let's use that. Let's say the variance we measured in our Norwegian sample is Var(X) = 9in^2 (the units are inches squared). This gives us a sample standard deviation of 3in (square root of the variance). We'll use this as our estimate of the true variance / standard deviation of population.

The variance of the sampling distribution is equal to the true population variance divided by the sample size: 9/30 = 0.3. The square root of the variance is the standard deviation: sqrt(0.3) = 0.55. This is the standard deviation of the sampling distribution, which is commonly referred to as the standard error of the mean.

At this point we can construct an estimate of the sampling distribution of the mean, based on the given sample size (N = 30), the assumed population mean (from the null hypothesis - 67in), and the estimated population variance (estimated from the sample - 9):

N <- 30
mean_height <- 67
var_height <- 9
se_mean <- sqrt( var_height / N )

x <- seq(63.5,70.5,0.01)
ggplot() + 
    stat_function(aes(x=x), 
                  fun=dnorm, 
                  arg=list(mean=mean_height, sd=se_mean),
                  size=1, 
                  colour="blue") +
    ggtitle("Sampling distribution of the mean\n(sample size N=30)") + 
    geom_hline(y=0, colour="darkgray") +
    geom_vline(x=mean_height, linetype="dashed", colour="red", size=1) +
    ylab("Probability density") + 
    xlab("Sample means")  +
    scale_x_continuous(breaks=64:70, labels=64:70)

Recall that the sampling distribution of the mean is normally distributed and centered around the true mean of the underlying population. Under the null hypothesis, we're assuming that the true mean is 67in (depicted by the vertical red line).

So where does our observed sample mean fall in this distribution? The chart below is the same as the one above, with the addition of the vertical orange line, which shows where the sample mean falls in the sampling distribution

As you can see, our sample mean falls pretty far out along the upper tail of the sampling distribution. So what does this tell us? Remember what the sampling distribution represents: it is the distribution of sample means for all samples of Norwegians' heights of sample size N=30. If we could take multiple samples of Norwegians, measure their heights, and compute sample means, the sample means ought to be distributed as above, given the assumptions we're making under the null hypothesis, namely that the true population mean is 67in. So most sample means should fall around 67in, with fewer around 66in or 68in, and fewer still around 66in and 69in.

Our sample mean is 68.5in, denoted by the orange line. We can already tell that our sample mean, when viewed as a random outcome drawn from the normally distributed sampling distribution, has a low probability of being observed. How low? We can use our estimated sampling distribution to give us an estimated probability.

When testing for statistical significance, we want to compute the probability of observing a sample mean as extreme as the one we observed. The probability of observing an outcome as extreme as 68.5 is equivalent to the probability of observing an outcome of 68.5 or greater. This is given by the area under the curve for the range >= 68.5, which is shaded in yellow below.

As you can see, it's a tiny range, so we should expect a tiny probability. We can use R's pnorm function to calculate the probability. Remember that pnorm computes probabilities across ranges. If you give it a single outcome value, then it returns the probability of observing the given outcome or lower. The probability of observing the given outcome or greater is simply 1 minus the former probability.

1 - pnorm(68.5, mean=67, sd=sqrt(9/30))
## [1] 0.003

So the probability of observing an outcome of 68.5 or greater is 0.003. In other words, there's a 0.3% chance of observing a sample mean of 68.5in for a sample of size N=30, given the assumption that the true mean of the population is 67in. This probability is known as the p-value for our significance test, p=0.003.

So, is the result statistically significant? Well, there's no hard and fast rule for determining significance. Normally you'd compare the p-value against some threshold, say p <= 0.05. This threshold is known as the significance level. If the p-value is less than the significance level, the result is significant (and the null hypothesis should be rejected). If the p-value is greater than the significance level, then it's not significant (and the null hypothesis cannot be rejected). p <= 0.05 is a commonly chosen significance level, although some studies may choose a different level, depending on the nature of the experiment. A significance level of 0.05 is basically allotting a 5% chance of making a "false positive" - i.e. detecting an effect that's really not there, and incorrectly rejecting the null hypothesis when it is in fact true.

In general, the smaller the p-value, the more significant the result, since the p-value tells you how likely it is to observe that result under the assumption of the null hypothesis. If the likelihood is very low, then the null hypothesis is unlikely to be true and should be rejected.

For our Norwegian heights example, p = 0.003, which is (much) less than 0.05, so we would reject the null hypothesis and conclude the alternative hypothesis -- that Norwegians are, in fact, taller than the global average. Of course, despite an exceedingly small p-value, we cannot absolutely rule out sampling error. All we can say is that it is very unlikely that sampling error would explain this result.

One-tailed vs. two-tailed significance tests

The above is an example of a one-tailed significance test, in that we only considered the probability contained in one tail (the upper tail) of the distribution. In a two-tailed test, we'd also consider the probability of observing such an extreme outcome in the opposite direction (the lower tail). The two-tailed probability is the region shaded in yellow below:

We could use pnorm to calculate the total probability of the two shaded regions. But given the symmetry of the normal distribution, we know that the probability for the two-tailed test is simply twice the probability of the one-tailed test.

pnorm(65.5, mean=67, sd=sqrt(9/30)) 
## [1] 0.003

1 - pnorm(68.5, mean=67, sd=sqrt(9/30))
## [1] 0.003

2 * pnorm(65.5, mean=67, sd=sqrt(9/30)) 
## [1] 0.006

Whether you use the one-tailed or two-tailed test depends on the nature of your experiment. If you only care about whether or not the mean in question is different from the null-hypothesis mean, then you'd use the two-tailed test, since the mean could be different in either direction (lower or higher). If, on the other hand, you expect the mean in question to be different in a specific direction (for example you're running the experiment under the assumption that Norwegians are taller), then you might be interested in only one tail, since you don't expect the result to be skewed in the other direction.

The effect of sample size on significance

Recall that the sample size influences the shape of the sampling distribution; therefore it also influences the p-value of the statistical significance test. To illustrate the effect, let's use the same data from the example above, but this time we'll decrease the sample size to N=10.

As you can see, with a smaller sample size, the probability of observing such an extreme result goes up, since smaller samples are more likely to be skewed away from the true mean of the population (or, thinking about it the other way, larger samples are more likely to closely approximate the true mean).

Again we can use pnorm to compute the probability:

1-pnorm(68.5, mean=67, sd=sqrt(10/9))
## [1] 0.077

So when the sample size drops from N=30 to N=10, the p-value rises from p=0.003 to p=0.077. Notice the p-value has risen above the significance level of p <= 0.05. This means the sample result is no longer statistically significant (at that level); therefore we cannot reject the null hypothesis.

Recap

A random variable is described by the characteristics of its distribution
The expected value, E[X], of a distribution is the weighted average of all outcomes. It's the center of mass of the distribution.
The variance, Var(X), is the "measure of spread" of a distribution.
The standard deviation of a distribution is the square root of its variance
A probability density function for continuous random variables takes an outcome value as input and returns the probability density for the given outcome
The probability of observing an outcome within a given range can be determined by computing the area under the curve of the probability density function within the given range.
A probability mass function for discrete random variables takes an outcome value as input and returns the actual probability for the given outcome
A sample is a subset of a population. Statistical methods and principles are applied to the sample's distribution in order to make inferences about the true distribution -- i.e. the distribution across the population as a whole
A summary statistic is a value that summarizes sample data, e.g. the mean or the variance
A sampling distribution is the distribution of a summary statistic (e.g. the mean) calculated from multiple samples drawn from an underlying random variable distribution
The Central Limit Theorem states that, regardless of the underlying distribution, the sampling distribution of the mean is normally distributed, with mean equal to the underlying population mean and variance equal to the underlying population variance divided by the sample size
An outcome's z-score is calculated by taking the difference between the outcome and the mean, then dividing by the standard deviation. A z-score is in units of standard deviations.
A statistical significance test gives the probability of observing a given outcome under the assumption of a null hypothesis. The probability is known as the p-value for the test. A p-value <= 0.05 is typically considered significant.

Testing against the null hypothesis

Estimating the sampling distribution of the mean

One-tailed vs. two-tailed significance tests

The effect of sample size on significance

Recap

Intro to Statistics ... 9 . 10 . 11 . 12 . 13 ...