Intro to Statistics: Part 5: A Brief Intro to Experiments, Samples, and Statistical Inference

So far we've looked at several examples of random variables where the set of possible outcomes and their associated probabilities can be calculated ahead of time, without actually having to conduct any experiments. Now let's look at a different kind of random variable, one whose true distribution -- i.e. its complete set of outcomes and probabilities -- is not known ahead of time, and therefore must be estimated thru experimentation.

Let's say our random variable represents the height of a randomly selected person. There's no way to mathematically work out the complete set of outcomes (heights) and their associated probabilities ahead of time. The only way to determine the characteristics of the distribution is to conduct experiments and actually go out there and measure the heights of randomly selected people.

Now, measuring the height of just one person doesn't really tell us much about our random variable, for it's just a single data point. If we want to accurately estimate the characteristics of a random variable's distribution, we need multiple data points - the more the better.

Sampling from the population

So let's say we measure the heights of 30 people selected randomly from the population. This serves as our sample. The sample size is N=30. We can calculate various statistics about this sample, such as its mean and variance. The sample's distribution provides an estimate of the true distribution of our random variable -- i.e. the true distribution of heights across the entire population.

The true distribution of heights is not known. It could be determined by measuring the height of every single person in the population. However, that might be impossible to actually do, if the population size is large.

So instead we take a sample from the population, calculate statistics on the sample's distribution, then apply statistical principles to make informed estimates about the characteristics of the true distribution across the entire population. This process is known as statistical inference.

As our sample size gets bigger, the sample's estimate of the true distribution becomes more accurate. This makes sense. As the sample size gets bigger, it approaches the actual population size, and if you could measure the entire population then you'd know the true distribution exactly. So the sample size is a key factor when considering how accurately the sample estimates the true distribution.

A sample dataset: father-son heights

So let's walk thru an example of taking a sample and computing some summary statistics on it. For this example we'll use the father-son heights dataset provided by the UsingR library in R. The father-son heights dataset contains 1078 measurements of fathers' heights along with their sons' heights. Let's focus on just the fathers' heights, so our sample size is N=1078.

On the chart below is a plot of all the observed heights in the sample. The heights are listed along the y-axis (the x-axis doesn't represent anything here, other than "observation #1", "observation #2", and so on out to "observation #1078"). The sample mean is depicted by the dashed blue line. The sample mean is an arithmetic mean, and is effectively equal to the expected value of the sample's distribution, since each outcome (height) is weighted equally. I've included the R code for generating the plot.

library(UsingR)
library(ggplot2)
data(father.son)
heights <- sample(father.son$fheight)
ggplot() + 
    geom_point(aes(x=1:1078,y=heights),size=3,shape=21,fill="yellow") + 
    geom_hline(y=mean(heights),linetype="dashed",colour="blue", size=1) +
    ggtitle("Height Observations") + 
    ylab("Height (inches)") + 
    xlab("") + 
    theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())

Notice how the heights are clustered around the mean. The mean is approximately 67.5in. Most observations fall between 65 - 70in. Fewer observations fall a bit further out, between 62.5 - 65in on the short side and 70 - 72.5in on the tall side. Still fewer fall even further away, around 60in and 75in. This kind of dispersion is expected. We'd expect most people to be close to the average height, with a few people a bit taller or shorter, and even fewer people who are extremely tall or short.

Plotting the distribution with a histogram

Another way to chart the distribution of heights in the sample (or generally speaking, the distribution of outcomes for a random variable) is to use a histogram. In a histogram, the heights (outcomes) are listed along the x-axis (not the y-axis like the plot above). The y-axis of the histogram indicates the number of times a given height is observed. Since height is a continuous variable, we simplify things by "binning" the heights into ranges (or "bins") and counting up the number of observations that fall into each range. For this example let's use a bin width of 1 inch.

ggplot() + 
    geom_histogram(aes(x=heights), binwidth=1, fill="yellow", colour="black") +
    ggtitle("Frequencies histogram of observed heights") + 
    ylab("Number of observations within each inch of height") + 
    xlab("Heights (bin width = 1 in) ") +
    geom_vline(x=mean(heights),linetype="dashed",size=1,colour="blue")

The blue vertical dashed line depicts the sample mean, which as expected cuts right down the middle of the data and balances the distribution. The sample mean is equivalent to the expected value of the random variable that represents our sample.

Probability Density Histogram

A density histogram is similar to a frequencies histogram (like the one above). The frequencies histogram charts counts (as in the number of observations that fall in each bin), whereas the density histogram plots probability densities. We can convert the above histogram from a frequencies histogram to a density histogram by simply adding the parameter y=..density.. to our geom_histogram:

ggplot() + 
    geom_histogram(aes(y=..density.., x=heights), 
                   binwidth=1, fill="yellow", colour="black") +
    ggtitle("Density histogram of observed heights") + 
    ylab("Proportion of observations per unit outcome") + 
    xlab("Heights (bin width = 1 in) ") +
    geom_vline(x=mean(heights),linetype="dashed",size=1,colour="blue")

Notice that the shape of the histogram didn't change, nor did the scale of the x-axis (the outcomes). The only thing that changed is the scale of the y-axis. The y-axis now shows the probability density of each bin, instead of the number of observations in each bin (as shown in the previous chart). The probability density is the proporition of observations that fall in each bin per unit of outcome:

probability\ density = \frac{\{\frac{\#\ of\ observations\ in\ bin}{total\ \#\ of\ observations}\}}{bin\ width}

The numerator gives you the proporition of observations that fall in a given bin, which is essentially the probability of observing an outcome in that bin. The denominator normalizes that probability per unit of outcome (which is where the "density" comes from).

For example, the frequencies histogram of heights shows that about 150 observations fall in the tallest bin, from 67-68in. The total number of observations is 1078. The bin width is 1 inch (which also happens to be the "unit of outcome", since the units are inches). So the probability density of this bin is computed as:

\begin{align*}density & = \frac{\{\frac{150}{1078}\}}{1}\\ \\ & = 0.139\end{align}

...which corresponds to the height of the bin in the density histogram.

We can compute the probability associated with each bin by multiplying the density of the bin by its bin width. This in effect cancels out the division we did in the density calculation above, leaving us with just the proporition of outcomes that fall in each bin.

The probability of a bin tells you how likely it would be, if you selected a height at random from the sample, that the height would fall in that bin. For example, the bin from 67 - 68in has a density of approximately 0.139, which when multiplied by the bin width (1 in), gives the probability for that bin: 0.139. That is to say, if you were to draw a height at random from this sample, there's a 13.9% chance it would fall between 67 and 68in.

Adjusting the bin width

We can see more clearly the effect of the bin width by plotting the histogram(s) again, this time using a bin width of 0.5 inches:

ggplot() + 
    geom_histogram(aes(x=heights), binwidth=0.5, fill="yellow", colour="black") +
    ggtitle("Frequencies histogram of observed heights, bin width = 0.5 in") + 
    ylab("Number of observations within each 0.5in interval of height") + 
    xlab("Heights (bin width = 0.5 in) ") +
    geom_vline(x=mean(heights),linetype="dashed",size=1,colour="blue")

ggplot() + 
    geom_histogram(aes(y=..density.., x=heights), 
                   binwidth=0.5, fill="yellow", colour="black") +
    ggtitle("Density histogram of observed heights, bin width = 0.5 in") + 
    ylab("Proportion of observations per unit outcome") + 
    xlab("Heights (bin width = 0.5 in) ") +
    geom_vline(x=mean(heights),linetype="dashed",size=1,colour="blue")

The number of observations in the tallest bin, 67.5 - 68in, given by the frequencies histogram (first chart) is approximately 80. The bin width is 0.5in. This corresponds to a density of:

\begin{align*}density & = \frac{\{\frac{80}{1078}\}}{0.5}\\ \\ & = 0.148\end{align}

... which matches up with the height of the bin in the density histogram (second chart). The probability for this bin is the density multiplied by the bin width:

\begin{align*}probability &= 0.148 \times  0.5 \\ \\ &=0.074\end{align*}

Note that the probability is equal to the area of the bar (height * width) for each bin. Also note that the total area of all bins combined equals 1, which is the same as saying the sum of probabilities for all possible outcomes is 1. Keep this in mind for the next article, when we go over probability density functions.

Quick preview: Distribution Curves (a.k.a. Probability Density Functions)

You may have noticed the distribution of our sample resembles the shape of the familiar "bell curve". The bell curve represents the distribution that is formally known as the normal distribution. The bell curve itself is the probability density function for the normal distribution. We'll talk about probability density functions and the normal distribution later on, but all you need to know for now is that the shape of a normal distribution curve is governed by two factors: the mean and standard deviation of the distribution. With these two statistics from our sample, we can construct a normal distribution curve (the red line below) and lay it over the actual data.

ggplot() + 
    geom_histogram(aes(y=..density..,x=heights), 
                   binwidth=1, fill="yellow", colour="black") +
    ggtitle("Density histogram of observed heights") + 
    ylab("Proportion of observations per unit outcome") + 
    xlab("Heights (bin width = 1 in) ") +
    geom_vline(x=mean(heights),linetype="dashed",size=1,colour="blue") +
    stat_function(aes(x=heights), fun = dnorm, colour="red",size=1,
                  arg=list(mean=mean(heights), sd=sd(heights)) )

As you can see it's a close fit.

So what does this tell us about the true distribution of heights for the entire population? How accurate of an estimate does our sample provide? Before we can answer these questions, we need to cover a few more topics first. Stay tuned....

Recap

A random variable is described by the characteristics of its distribution
Each outcome in the distribution has an associated probability of occurring
The sum of the probabilities of all possible outcomes equals 1
The expected value, E[X], of a distribution is the weighted average of all outcomes, where each outcome is weighted by its probability
The variance, Var(X), is the "measure of spread" of a distribution. It's calculated by taking the weighted average of the squared differences between each outcome and the expected value.
The standard deviation of a distribution is the square root of its variance
A random variable's distribution is commonly plotted using a histogram of either frequencies (counts) or density (proportions)
The probability of a bin in a density histogram is equal to its density multiplied by the bin width - which is the same as computing the area of the bin's bar (height * width).
A sample is a subset of a population. Statistical methods and principles are applied to samples in order to make inferences about the population as a whole
The normal distribution, aka the "bell curve", is a common distribution pattern for random variables found in nature (such as people's heights). Its shape is determined by two factors: the mean and variance of the distribution data.