43 Variance Parameter

43.1 Defining Variance With Expected Values

In the case of a discrete random variable, the variance is \[\begin{aligned} \sigma^2 &= \sum\limits_{x=0}^{\infty}(x-\mu)^2p(x) \\ &= \sum\limits_{x=0}^{\infty}(x^2-2\mu x+\mu^2)p(x) \\ &= \sum\limits_{x=0}^{\infty}(x^2p(x)-2\mu x\cdot p(x)+\mu^2p(x)) \\ &= \sum\limits_{x=0}^{\infty}x^2p(x)-\sum\limits_{x=0}^{\infty}2\mu x\cdot p(x) + \sum\limits_{x=0}^{\infty}\mu^2p(x) \\ &= \sum\limits_{x=0}^{\infty}x^2p(x)-2\mu\sum\limits_{x=0}^{\infty}x\cdot p(x) + \mu^2\sum\limits_{x=0}^{\infty}p(x) \\ &= \sum\limits_{x=0}^{\infty}x^2p(x)-2\mu\cdot\mu+\mu^2 \\ &= \sum\limits_{x=0}^{\infty}x^2p(x)-\mu^2 \\ &= E(X^2)-E(X)^2\\ \end{aligned}\]

In the case of a continuous random variable, the variance is \[\begin{aligned} \sigma^2 &= \int\limits_{-\infty}^{\infty}(x-\mu)^2f(x)dx \\ &= \int\limits_{-\infty}^{\infty}(x^2-2\mu x+\mu^2)f(x)dx \\ &= \int\limits_{-\infty}^{\infty}(x^2f(x)-2\mu x\cdot f(x)+\mu^2f(x))dx \\ &= \int\limits_{-\infty}^{\infty}x^2f(x)dx-\int\limits_{-\infty}^{\infty}2\mu x\cdot f(x)dx + \int\limits_{-\infty}^{\infty}\mu^2f(x)dx \\ &= \int\limits_{-\infty}^{\infty}x^2f(x)dx-2\mu\int\limits_{-\infty}^{\infty}x\cdot f(x)dx + \mu^2\int\limits_{-\infty}^{\infty}f(x)dx \\ &= \int\limits_{-\infty}^{\infty}x^2f(x)dx-2\mu\cdot\mu+\mu^2 \\ &= \int\limits_{-\infty}^{\infty}x^2f(x)dx-\mu^2 \\ &= E(X^2)-E(X)^2 \end{aligned}\]

In general, these results may be summarized as follows:

\[\begin{aligned} \sigma^2 &= E[(X-\mu)^2] \\ &= E[(X^2-2\mu X+\mu^2)] \\ &= E(X^2) - E(2\mu X) + E(\mu^2) \\ &= E(X^2) - 2\mu E(X) + \mu^2 \\ &= E(X^2) - 2\mu\cdot\mu + \mu^2 \\ &= E(X^2) - 2\mu^2 + \mu \\ &= E(X^2) - \mu^2 \\ &= E(X^2) - E(X)^2 \end{aligned}\]

43.2 Unbiased Estimator

\[\begin{aligned} E\Bigg(\frac{\sum\limits_{i=1}^{n}(x_i-\bar x)^2}{n}\Bigg) &= \frac{1}{n}E\Big(\sum\limits_{i=1}^{n}(x_i-\bar x)^2\Big) \\ &= \frac{1}{n}E\Big(\sum\limits_{i=1}^{n}(x_i^2-2\bar x x_i+\bar x^2)\Big) \\ &= \frac{1}{n}E\Big(\sum\limits_{i=1}^{n}x_i^2 - \sum\limits_{i=1}^{n}2\bar x x_i+\sum\limits_{i=1}^{n}\bar x^2\Big) \\ &= \frac{1}{n} E\Big(\sum\limits_{i=1}^{n}x_i^2-2 \frac{\sum\limits_{i=1}^{n}x_i}{n}\sum\limits_{i=1}^{n} + n\bar x^2\Big) \\ &= \frac{1}{n} E\Big(\sum\limits_{i=1}^{n}x_i^2-2 \frac{\Big(\sum\limits_{i=1}^{n}x_i\Big)^2}{n}+n\bar x^2\Big) \\ &= \frac{1}{n} E\Big(\sum\limits_{i=1}^{n}x_i^2-2 \frac{n(\sum\limits_{i=1}^{n}x_i)^2}{n^2}+n\bar x^2\Big) \\ &= \frac{1}{n}E\Big(\sum\limits_{i=1}^{n}x_i^2-2n\bar x^2+n\bar x^2\Big) \\ &= \frac{1}{n}E\Big(\sum\limits_{i=1}^{n}x_i^2-n\bar x^2\Big) \\ &= \frac{1}{n}E\Big(\sum\limits_{i=1}^{n}x_i^2\Big)-E(n\bar x^2) \\ &= \frac{1}{n}E\Big(\sum\limits_{i=1}^{n}x_i^2)-nE(\bar x^2)\Big) \\ &= \frac{1}{n}\Big[\sum\limits_{i=1}^{n}E(x_i^2)-nE(\bar x^2)\Big] \\ ^{[1]} &= \frac{1}{n}\Big[\sum\limits_{i=1}^{n}\Big(\sigma^2+\mu^2\Big) - nE(\bar x^2)\Big] \\ ^{[2]} &= \frac{1}{n}\Big[\sum\limits_{i=1}^{n}\Big(\sigma^2+\mu^2\Big) - n(\frac{\sigma^2}{n}+\mu^2)\Big]\\\\ &= \frac{1}{n}(n\sigma^2-n\mu^2+\sigma^2-n\mu^2) \\ &=\frac{1}{n}(n\sigma^2-\sigma) \\ &= \frac{1}{n}(n-1)\sigma^2 \\ &= \frac{n-1}{n}\sigma^2 \end{aligned}\]

  1. \(V(X)=E(X^2)-E(X)^2\)
    \(\ \ \ \ \Rightarrow E(X^2)=V(X)+E(X)^2=\sigma^2+\mu^2\) \(V(\bar X)=E(\bar X^2)-E(\bar X)^2\)
    \(\ \ \ \ \Rightarrow E(\bar X^2)=V(\bar X)+E(\bar X)^2 = \frac{\sigma^2}{n}+\mu^2\)
  2. By the Central Limit Theorem, \(V(\bar X)=\frac{\sigma^2}{n}\)

Since \(E\Bigg(\frac{\sum\limits_{i=1}^{n}(x_i-\bar x)^2}{n}\Bigg)\neq\sigma^2\) it is a biased estimator. Notice, however, that the bias can be eliminated by dividing by \(n-1\) instead of by \(n\)

\[\begin{aligned} E\Bigg(\frac{\sum\limits_{i=1}^{n}(x_i-\bar x)^2}{n-1}\Bigg) &= \frac{1}{n-1}E\Big(\sum\limits_{i=1}^{n}(x_i-\bar x)^2\Big) \\ &= \frac{1}{n-1}E\Big(\sum\limits_{i=1}^{n}(x_i^2-2\bar x x_i+\bar x^2)\Big) \\ &= \frac{1}{n-1}E\Big(\sum\limits_{i=1}^{n}x_i^2 - \sum\limits_{i=1}^{n}2\bar x x_i+\sum\limits_{i=1}^{n}\bar x^2\Big) \\ &= \frac{1}{n-1} E\Big(\sum\limits_{i=1}^{n}x_i^2 - 2\frac{\sum\limits_{i=1}^{n}x_i}{n}\sum\limits_{i=1}^{n} + n\bar x^2\Big) \\ &= \frac{1}{n-1} E\Big(\sum\limits_{i=1}^{n}x_i^2- 2\frac{(\sum\limits_{i=1}^{n}x_i)^2}{n}+n\bar x^2\Big) \\ &= \frac{1}{n-1} E\Big(\sum\limits_{i=1}^{n}x_i^2- 2\frac{n\Big(\sum\limits_{i=1}^{n}x_i\Big)^2}{n^2} + n\bar x^2\Big) \\ &= \frac{1}{n-1}E\Big(\sum\limits_{i=1}^{n}x_i^2-2n\bar x^2+n\bar x^2\Big) \\ &= \frac{1}{n-1}E\Big(\sum\limits_{i=1}^{n}x_i^2-n\bar x^2\Big) \\ &= \frac{1}{n-1}E\Big(\sum\limits_{i=1}^{n}x_i^2\Big)-E(n\bar x^2) \\ &= \frac{1}{n-1}E\Big(\sum\limits_{i=1}^{n}x_i^2\Big)-nE(\bar x^2) \\ &= \frac{1}{n-1}\Big[\sum\limits_{i=1}^{n}E(x_i^2)-nE(\bar x^2)\Big] \\ ^{[1]} &= \frac{1}{n-1}\Big[\sum\limits_{i=1}^{n}(\sigma^2+\mu^2)-nE(\bar x^2)\Big] \\ ^{[2]} &= \frac{1}{n-1}\Big[\sum\limits_{i=1}^{n}(\sigma^2+\mu^2) - n(\frac{\sigma^2}{n}+\mu^2)\Big] \\ &= \frac{1}{n-1}(n\sigma^2-n\mu^2+\sigma^2-n\mu^2) \\ &= \frac{1}{n}(n\sigma^2-\sigma) \\ &= \frac{1}{n-1}(n-1)\sigma^2 \\ &= \frac{n-1}{n-1}\sigma^2 \\ &=\sigma^2 \end{aligned}\]

  1. \(V(X)=E(X^2)-E(X)^2\)
    \(\ \ \ \ \Rightarrow E(X^2)=V(X)+E(X)^2=\sigma^2+\mu^2\) \(V(\bar X)=E(\bar X^2)-E(\bar X)^2\)
    \(\ \ \ \ \Rightarrow E(\bar X^2)=V(\bar X)+E(\bar X)^2 = \frac{\sigma^2}{n}+\mu^2\)
  2. By the Central Limit Theorem, \(V(\bar X)=\frac{\sigma^2}{n}\)

Thus \(E\Bigg(\frac{\sum\limits_{i=1}^{n}(x_i-\bar x)^2}{n-1}\Bigg)\) is an unbiased estimator of \(\sigma^2\), and we define the estimator

\[s^2= \frac{\sum\limits_{i=1}^{n}(x_i-\bar x)^2}{n-1}\]

43.3 Computational Formulae

43.3.1 Computational Formula for \(\sigma\)^2

\[\begin{aligned} \sigma^2 &= \frac{\sum\limits_{i=1}^{N}(x_i-\mu)^2}{N} \\ &= \frac{\sum\limits_{i=1}^{N}x_i^2-\frac{\Big(\sum\limits_{i=1}^{N}x_i\Big)^2}{N}}{N} \end{aligned}\]

Proof:

\[\begin{aligned} \frac{\sum\limits_{i=1}^{N}(x_i-\mu)^2}{N} &= \frac{\sum\limits_{i=1}^{N}(x_i^2-2\mu x_i+\mu^2)}{N} \\ &= \frac{\sum\limits_{i=1}^{N} x_i^2-\sum\limits_{i=1}^{N}2\mu x_i + \sum\limits_{i=1}^{N}\mu^2}{N} \\ &= \frac{\sum\limits_{i=1}^{N} x_i^2-2\mu\sum\limits_{i=1}^{N}x_i+N\mu^2}{N} \\ &= \frac{\sum\limits_{i=1}^{N}x_i^2 -2\frac{\sum\limits_{i=1}^{N}x_i}{N}\sum\limits_{i=1}^{N}x_i + N\Big(\frac{\sum\limits_{i=1}^{N}x_i}{N}\Big)^2}{N} \\ &= \frac{\sum\limits_{i=1}^{N} x_i^2-2\frac{\Big(\sum\limits_{i=1}^{N}x_i\Big)^2}{N} + \frac{\Big(\sum\limits_{i=1}^{N}x_i\Big)^2}{N}}{N} \\ &= \frac{\sum\limits_{i=1}^{N}x_i^2-\frac{\Big(\sum\limits_{i=1}^{N}x_i\Big)^2}{N}}{N} \end{aligned}\]

43.3.2 Computational Formula for \(s\)^2

\[\begin{aligned} s^2 &= \frac{\sum\limits_{i=1}^{n}(x_i-\bar x)^2}{n-1} \\ &= \frac{\sum\limits_{i=1}^{n}x_i^2-\frac{\Big(\sum\limits_{i=1}^{n}x_i\Big)^2}{n}}{n-1} \end{aligned}\]

Proof:

\[\begin{aligned} \frac{\sum\limits_{i=1}^{n}(x_i-\bar x)^2}{n-1} &= \frac{\sum\limits_{i=1}^{n}(x_i^2-2\bar x x_i+\bar x^2)}{n-1} \\ &= \frac{\sum\limits_{i=1}^{n}x_i^2-\sum\limits_{i=1}^{n}2\bar x x_i + \sum\limits_{i=1}^{n}\bar x^2}{n-1} \\ &= \frac{\sum\limits_{i=1}^{n}x_i^2-2\bar x\sum\limits_{i=1}^{n}x_i+n\bar x^2}{n-1} \\ &= \frac{\sum\limits_{i=1}^{n}x_i^2-2\frac{\sum\limits_{i=1}^{n}x_i}{n}\sum\limits_{i=1}^{n}x_i + n\Big(\frac{\sum\limits_{i=1}^{n}x_i}{n}\Big)^2}{n-1} \\ &= \frac{\sum\limits_{i=1}^{n}x_i^2-2\frac{\Big(\sum\limits_{i=1}^{n}x_i\Big)^2}{n} + \frac{\Big(\sum\limits_{i=1}^{n}x_i\Big)^2}{n}}{n-1} \\ &= \frac{\sum\limits_{i=1}^{n}x_i^2-\frac{\Big(\sum\limits_{i=1}^{n}x_i\Big)^2}{n}}{n-1} \end{aligned}\]

43.3.3 Corollary: Alternative Computational Formula for \(s^2\)

\[\begin{aligned} s^2 &= \frac{1}{n \cdot (n - 1)} \Big(n \cdot \sum\limits_{i=1}^n x_i^2 - \big(\sum\limits_{i = 1}^n x_i\big)^2\Big) \end{aligned}\]

Proof:

Beginning with the result from 43.3.2:

\[\begin{aligned} s^2 &= \frac{\sum\limits_{i=1}^{n}x_i^2-\frac{\Big(\sum\limits_{i=1}^{n}x_i\Big)^2}{n}}{n-1} \\ &= \frac{1}{n-1} \Big(\sum\limits_{i=1}^{n}x_i^2-\frac{\Big(\sum\limits_{i=1}^{n}x_i\Big)^2}{n}\Big) \\ &= \frac{1}{n-1}\Big(\frac{n \cdot \sum\limits_{i=1}^{n}x_i^2}{n} - \frac{\Big(\sum\limits_{i=1}^{n}x_i\Big)^2}{n}\Big) \\ &= \frac{1}{n \cdot (n - 1)} \Big(n \cdot \sum\limits_{i=1}^n x_i^2 - \big(\sum\limits_{i = 1}^n x_i\big)^2\Big) \end{aligned}\]

43.3.3.1 Application

Both this result and the previous result may be used to optimize calculation of the sample variance in that they permit calculation to occur in one pass of the data. Consider the following sample:

x <- c(2, 11, 9, 7, 13, 3, 20)

Using the formula

\[s^2 = \frac{\sum\limits_{i=1}^n(x_i - \bar{x})^2)}{n - 1}\]

the sample variance is calculated

x_bar <- 0
n <- length(x)
for (i in seq_along(x)){
  x_bar <- x_bar + x[i] / n
}

variance <- 0
for (i in seq_along(x)){
  variance <- variance + (x[i] - x_bar)^2 / (n - 1)
}

This calculation required the use of two for loops. Using the computational formula

\[\begin{aligned} s^2 &= \frac{1}{n \cdot (n - 1)} \Big(n \cdot \sum\limits_{i=1}^n x_i^2 - \big(\sum\limits_{i = 1}^n x_i\big)^2\Big) \end{aligned}\]

the sample variance can be calculated using one for loop.

sum_square <- 0
sum_i <- 0
n <- length(x)
for(i in seq_along(x)){
  sum_square <- sum_square + x[i]^2
  sum_i <- sum_i + x[i]
}

variance <- 1 / (n * (n - 1)) * (n * sum_square - sum_i^2)

Comparing the processing time of these two approaches shows that the computational formula, utilizing a single for loop, requires less than half the processing time of the definitional formula.

library(microbenchmark)
# sample of 1,000 values between 1 and 100
x <- sample(1:100, 1000, replace = TRUE)
microbenchmark(
  standard = 
  {
    x_bar <- 0
    n <- length(x)
    for (i in seq_along(x)){
      x_bar <- x_bar + x[i] / n
    }
    
    variance <- 0
    for (i in seq_along(x)){
      variance <- variance + (x[i] - x_bar)^2 / (n - 1)
    }
  },
  computational = 
  {
    sum_square <- 0
    sum_i <- 0
    n <- length(x)
    for(i in seq_along(x)){
      sum_square <- sum_square + x[i]^2
      sum_i <- sum_i + x[i]
    }
    
    variance <- 1 / (n * (n - 1)) * (n * sum_square - sum_i^2)
  }
)
## Unit: milliseconds
##           expr       min        lq      mean   median        uq      max
##       standard 10.469167 11.512988 13.097602 12.07349 13.716589 32.62342
##  computational  4.809786  5.295367  6.017872  5.61363  6.135976 14.91976
##  neval cld
##    100   b
##    100  a