Intro to Statistics: Part 10: Z-scores, Standardizing, and the Standard Normal Distribution

Before going any further in this series, we should quickly go over the topic of z-scoring.  Z-scoring is a very simple scaling operation done to the outcomes of a random variable distribution, for the purpose of standardizing the distribution.   A z-score is calculated by:

\begin{align*}z & = \frac{x-\mu}{\sigma}\\ \\where,\\x & \ is\ the\ outcome\\\mu& \ is\ the\ mean\\\sigma& \ is\ the\ standard\ deviation\end{align*}

The z-score for an outcome is the distance between the outcome and the mean, divided by the standard deviation.  It tells you how far an outcome is from the mean, relative to the standard deviation of the distribution.  If an outcome is equal to the mean, then its z-score is 0.  If an outcome is exactly one standard deviation away from the mean, then its z-score is either 1 or -1, depending on whether it's above (1) or below (-1) the mean. You could say that the units of a z-score are standard deviations.

Note that nothing else about the random variable or its distribution changes. All we've done is translated the outcomes into this "z-space" or "z-scale".  Z-scores are useful in that they standardize the distribution.  Standardized distributions are easier to compare to each other, since the specific units of the random variable(s) drop out and you're left with common units of standard deviations.  

 

A quick z-score example

For example, imagine you had two random variables, one representing people's heights, the other representing their weights.  The units of the two are different (inches vs kilograms), which makes it a little awkward to compare the two directly.  By z-scoring the outcomes, we convert them all to the common units of standard deviations, making the comparison a little easier to understand.  

Let's consider a concrete example.  Imagine you measure the heights and weights of a bunch of people and determine that the mean height is μ=67in with a standard deviation of σ=4in.  The mean weight is calculated to be μ=70kg with a standard deviation of σ=10kg (I'm just making up these numbers btw).

Suppose you select a single person at random from your sample whose height=74in and weight=89kg.  What does this tell you?  Well, on the face of it, it just gives you their height and weight, but one thing you might be interested in is how much this person varies from the average height and average weight (if, for instance, you're looking for a correlation between the two variables).  Z-scoring gives us a useful way to compare the two outcomes.  The z-score for height is calculated in the first figure, the z-score for weight in the second figure:

\begin{align*}z_h & = \frac{x-\mu_h}{\sigma_h}\\[8pt]z_h & = \frac{74-67}{4}\\[8pt]z_h & = 1.75\end{align*}
\begin{align*}z_w & = \frac{x-\mu_w}{\sigma_w}\\[8pt]z_w & = \frac{89-70}{10}\\[8pt]z_w & = 1.9\end{align*}

The z-scores give us an easier way to compare the outcomes of the two variables. This particular person has a height that is 1.75 standard deviations above the mean height, and a weight that is 1.9 standard deviations above the mean weight.  The two z-scores can be compared directly, since they're both in the same units (of standard deviations).  Note that they're roughly equal, which suggests that the two random variables, height and weight, might be correlated with each other (i.e people with greater-than-average height might also have greater-than-average weight).

 

The standard normal distribution

The standard normal distribution is a cornerstone of statistical analysis.  We'll encounter it from time to time as we continue thru this series.  It's very simple: the standard normal distribution is a normal distribution with mean=0 and standard deviation=1 (note that the variance is also 1, since variance is equal to standard deviation squared). 

\begin{align*}generic\ normal\ distribution & = \operatorname{N}(\mu,\sigma^2)\\[8pt]standard\ normal\ distribution & = \operatorname{N}(0,1)\end{align*}

Here's the probability density function for the standard normal distribution:

x <- seq(-5,5,0.01)
ggplot() + 
    stat_function(aes(x=x), fun=dnorm, size=1, colour="blue") +
    ggtitle("The Standard Normal Distribution") + 
    geom_hline(y=0, colour="darkgray") +
    geom_vline(x=0, colour="darkgray") +
    ylab("probability density") + 
    xlab("z-scores\n(units of standard deviations)") + 
    scale_x_continuous(breaks=-5:5, labels=-5:5)

The standard normal distribution is centered around a mean of 0.  This is similar to z-scores, which by definition always have a mean of 0.  The standard normal distribution has a standard deviation of 1, which again is similar to z-scores, in that z-scores are, by definition, scaled such that a z-score of 1 is equivalent to being one standard deviation away from the mean.

Recall from the article on common distribution patterns that we could use R functions like dnorm, pnorm, and qnorm to calculate things like densities, probabilities, and quantiles of a normal distribution, respectively.  If you don't specify the mean and standard deviation (sd) parameters, then these functions by default operate on the standard normal distribution (the default values are mean=0 and sd=1).  For example if we wanted to know the probability of an outcome being less than or equal to the mean, we'd use pnorm and give it the outcome value 0, since 0 represents the mean:

pnorm(0)
## [1] 0.5

As expected, there's a 0.5 or 50% chance of an outcome being below the mean (remember that the mean/expected value is the "center of mass" of the distribution that splits the proportion of outcomes in half).

The probability of an outcome being within one standard deviation of the mean can be calculated by first computing the probability of an outcome being less than or equal to +1 standard deviation, then subtracting away the probability of an outcome being less than or equal to -1 standard deviations.  This leaves us with the probability of an outcome being between -1 and +1 standard deviations:

pnorm(1) - pnorm(-1)
## [1] 0.6826895

Recall that the probability of an outcome falling in a given range is equal to the area under the distribution curve across that range.  In the charts below I've highlighted various ranges to give you an idea of how outcomes are distributed in a normal distribution.

The last chart shows that, in a normally distributed population, 99.7% of outcomes - pretty much all of them - fall within three standard deviations of the mean.

 

Z-score the heights dataset

Recall the dataset of people's heights that we looked at in Part 5 of this series.  The probability density histogram and the normal distribution curve that we fit to the data is given in the first chart below.  The probability density function of the z-scored heights is shown in the second chart.

The z-scored heights is a standard normal distribution.  We know from above that 68% of outcomes fall within one standard deviation of the mean.  We can verify this on the original dataset by calculating the percent of heights that fall within one standard deviation:

within.1.sd <- heights >= mean(heights) - sd(heights) & 
               heights <= mean(heights) + sd(heights)
round( sum(within.1.sd) / length(within.1.sd), 2)
## [1] 0.68

As expected, 68% of heights fall within a standard deviation of the mean.  This is strong evidence that the distribution of heights is truly distributed in a normal (a.k.a. Gaussian) fashion.

 

Recap

  1. A random variable is described by the characteristics of its distribution
  2. The expected value, E[X], of a distribution is the weighted average of all outcomes.  It's the center of mass of the distribution.
  3. The variance, Var(X), is the "measure of spread" of a distribution.
  4. The standard deviation of a distribution is the square root of its variance
  5. A probability density function for continuous random variables takes an outcome value as input and returns the probability density for the given outcome
  6. The probability of observing an outcome within a given range can be determined by computing the area under the curve of the probability density function within the given range.
  7. A probability mass function for discrete random variables takes an outcome value as input and returns the actual probability for the given outcome
  8. A sample is a subset of a population. Statistical methods and principles are applied to the sample's distribution in order to make inferences about the true distribution -- i.e. the distribution across the population as a whole
  9. A summary statistic is a value that summarizes sample data, e.g. the mean or the variance
  10. A sampling distribution is the distribution of a summary statistic (e.g. the mean) calculated from multiple samples drawn from an underlying random variable distribution
  11. The Central Limit Theorem states that, regardless of the underlying distribution, the sampling distribution of the mean is normally distributed, with mean equal to the underlying population mean and variance equal to the underlying population variance divided by the sample size
  12. An outcome's z-score is calculated by taking the difference between the outcome and the mean, then dividing by the standard deviation.  A z-score is in units of standard deviations.
 

Intro to Statistics ... 8 . 9 . 10 . 11 . 12 ...

/