Intro to Statistics: Part 16: t-test Significance Testing Between Two Samples

In the previous article we used the t-distribution to conduct a significance test in which we compared our sample mean against a null-hypothesized population mean. This is known as a one-sample t-test. A one-sample t-test involves a single sample, which is compared against some assumed value for the test statistic (in that example, the test statistic was the mean statistic).

In addition to testing against some null-hypothesized assumed value, we can also use the t-distribution to conduct significance testing between two samples.

For example, pharmaceutical studies often involves a test group and a control group, where the test group is given some new experimental drug, while the control group is given placebo. The two groups are then compared against each other to determine the effectiveness of the drug. Statistical signficance testing is applied to ascertain the probability (p-value) that any difference observed between the two groups is due merely to random chance, as opposed to being due to a real effect caused by the drug under study.

Types of samples: paired vs. independent

Significance testing between samples depends in part on the nature of the samples. Samples fall into one of two categories:

Paired samples
Independent samples

Paired samples are two samples created by making two measurements against the same set of subjects. For example, a study for a medication that reduces blood pressure might measure each subject's blood pressure before and after administering the drug. This results in two samples, where each measurement in the 'before' sample can be paired up with a measurement in the 'after' sample.

Independent samples are un-paired -- i.e. the samples contain different subjects. A test group and control group is an example of independent samples.

Independent samples are further broken down into two sub-categories:

Equal variances - the two samples have (roughly) equal variances
Un-equal variances - the two samples have different variances

The sample types (paired vs independent, equal vs un-equal variances) affect the calculation of the t-statistic. The t-statistic (aka t-score) is what we plot in the t-distribution in order to determine the p-value for the significance test. The sample types also affect the choice for degrees of freedom. The degrees of freedom determines which t-distribution curve (from the t-distribution family) we should use when plotting the t-statistic.

Calculating the t-statistic and degrees of freedom

We already covered a one-sample t-test in the previous article, but just for completeness I've included its t-statistic and degrees of freedom calculations here:

Paired samples

For a two-sample t-test using paired samples, first you compute the difference between each paired subject across the two samples. The resulting set of diffs becomes a sample itself, then you proceed as you would for a one-sample t-test. Typically the null hypothesis assumes that the difference between the two sample means is 0, and if that were the case, then the mean of the diffs would also be 0. So the t-statistic is computed by comparing the mean of the diffs against 0:

Independent samples, equal variances

For a two-sample t-test with independent samples that have equal variances, you first must calculate the pooled variance across the two samples. From there, you can calculate the t-statistic and degrees of freedom:

Note that, if the sample sizes are the same, then the pooled variance is simply the average of the two variances.

Independent samples, un-equal variances

And finally, for a two-sample t-test with independent samples that have un-equal variances, the t-statistic and degrees of freedom are calculated as follows:

If you're interested in learning how these formulas are derived, check out wikipedia.

Example(s) of two-sample t-tests

For these examples we'll need some sample data. Let's create some randomly. We'll create two samples, both with a sample size of 30, using the rnorm function. For one of the samples we'll add 1 to each value. This should give us a real, measurable difference between the two samples.

set.seed(2)
n1 <- 30
n2 <- 30
samp.1 <- rnorm(n1)
samp.2 <- rnorm(n2) + 1

The significance test gives us the probability that the difference between the samples is due to random chance. In this example, we know for a fact that there's a real difference, since we created that difference ourselves by shifting up the second sample by a full standard deviation (the standard deviation of the standard normal distribution is 1). So we expect to see a statistically significant difference between the two samples.

For the purpose of comparing and contrasting the different types of t-tests, we'll run the same sample data thru all three t-tests described above: (1) paired samples, (2) indepedent samples with equal variances, and (3) independent samples with un-equal variances. Normally you would not use all three t-tests. You'd use the one t-test most appropriate for your sample data.

Example: Paired samples

First we'll treat the data like paired samples.

samp.diffs <- samp.2 - samp.1
t.score <- mean(samp.diffs) / sqrt(var(samp.diffs)/n1)
## 2.293

df = n1 - 1
## 29

p.value <- 2 * (1 - pt(t.score, df=df) )
## 0.0293

Note that the p-value is calculated according to a two-tailed significance test. The chart below illustrates where that p-value comes from. It's the area of the yellow-shaded regions.

The vertical orange lines indicate the t-score and its corresponding t-score in the opposite direction. This is an example of a two-tailed significance test. Two-tailed significance tests allow the difference between samples to vary in either direction. If we were only interested in testing the difference in one direction, then we'd only consider the one tail going in that direction. The p-value for the one-tailed test is always half the p-value of the two-tailed test (because the t-distribution is symmetric).

Example: Independent samples, equal variances

Now let's treat the samples as if they were independent samples with equal variances.

var.pooled <- ( (n1-1) * var(samp.1) + (n2-1) * var(samp.2) ) / 
              (n1 + n2 - 2)
t.score <- (mean(samp.2) - mean(samp.1)) / 
           ( sqrt(var.pooled) * sqrt( 1/n1 + 1/n2 ) )
## 2.457141

df = n1 + n2 - 2
## 58

p.value <- 2 * (1 - pt(t.score, df=df) )
## 0.01701674

Example: Independent samples, un-equal variances

And finally, let's treat the samples as if they were independent samples with un-equal variances.

t.score <- (mean(samp.2) - mean(samp.1)) /  
           sqrt(var(samp.1)/n1 + var(samp.2)/n2 ) 
## 2.457141

df <- (var(samp.1)/n1 + var(samp.2)/n2)^2 / 
      ( (var(samp.1)/n1)^2 / (n1-1) + (var(samp.2)/n2)^2 / (n2-1) )
## 57.96975

p.value <- 2 * (1 - pt(t.score, df=df) )
## 0.01701837

Note that the t.score for both the equal variances and unequal variances is exactly the same. This is always true when the sample sizes are equal (n1 == n2). The p-values, however, are slightly different. This is due to the slight difference in degrees of freedom between equal variances and un-equal variances.

R provides a convenient function, t.test, that takes two samples and generates the t-statistic, degrees of freedom, and p-value:

t.test(samp.1, samp.2, paired=T)
## t = -2.2935, df = 29, p-value = 0.02926

t.test(samp.1, samp.2, paired=F, var.equal=T)
## t = -2.4571, df = 58, p-value = 0.01702

t.test(samp.1, samp.2, paired=F, var.equal=F)
## t = -2.4571, df = 57.97, p-value = 0.01702

As you can see, the t.test function produces the same results we computed manually above, the only difference being that the t-scores are negated. That is simply due to the order in which one sample is subtracted from another, which doesn't really matter if you're conducting a two-tailed test and therefore only care about the absolute value of the difference between samples.

Recap

A random variable is described by the characteristics of its distribution
The expected value, E[X], of a distribution is the weighted average of all outcomes. It's the center of mass of the distribution.
The variance, Var(X), is the "measure of spread" of a distribution.
The standard deviation of a distribution is the square root of its variance
A probability density function for continuous random variables takes an outcome value as input and returns the probability density for the given outcome
The probability of observing an outcome within a given range can be determined by computing the area under the curve of the probability density function within the given range.
A probability mass function for discrete random variables takes an outcome value as input and returns the actual probability for the given outcome
A sample is a subset of a population. Statistical methods and principles are applied to the sample's distribution in order to make inferences about the true distribution -- i.e. the distribution across the population as a whole
The sample variance is a biased estimate of the true population variance. The bias can be adjusted for by dividing the sum of squared diffs by n-1 instead of n, where n is the sample size.
A summary statistic is a value that summarizes sample data, e.g. the mean or the variance
A sampling distribution is the distribution of a summary statistic (e.g. the mean) calculated from multiple samples drawn from an underlying random variable distribution
The Central Limit Theorem states that, regardless of the underlying distribution, the sampling distribution of the mean is normally distributed, with its mean equal to the underlying population mean and its variance equal to the underlying population variance divided by the sample size
An outcome's z-score is calculated by taking the difference between the outcome and the mean, then dividing by the standard deviation. A z-score is in units of standard deviations.
A statistical significance test gives the probability of observing a given outcome under the assumption of a null hypothesis. The probability is known as the p-value for the test. A p-value <= 0.05 is typically considered significant.
The t-distribution is used for statistical significance tests when the sample size is small and/or when the true population variance is unknown and therefore estimated from the sample.
The sampling distribution of the variance follows a scaled chi-squared distribution with degrees of freedom equal to n-1, where n is the sample size.