Normal distributions as Taylor approximation

The difficulty with introducing the normal distribution

The normal distribution plays a central role in introductory statistics courses, but it can seem very mysterious and intimidating on first view. The mathematical expression for the density seems completely arbitrary for a distribution of such central importance. Many can calm their nerves after being presented the central limit theorem, but only until they realize that it is no more clear why this seemingly arbitrary distribution satisfies such a useful theorem. The CLT proof with characteristic functions may provide an inkling of intuition, but it can be lost due to the "computational trick"-nature of characteristic functions, or the students may simply not yet have met characteristic functions at all.

Most students in introductory statistics courses, however, have a background in calculus and have accepted the idea of approximating a function with a second order polynomial. Why not build on this intuition when first presenting normal distributions in introductory courses?

Approximation of the posterior distribution

Say that we have some likelihood \(p(y \mid \theta)\) and a prior \(p(\theta)\) and assume that this implies a unimodal posterior \(p(\theta \mid y) = p(y \mid \theta)p(\theta) / \alpha \propto p(y \mid \theta)p(\theta)\) where \( \alpha = \int p(y \mid \theta)p(\theta) d\theta \) is a normalization constant.

It can be impossible to analytically calculate the normalization constant for the posterior distribution, so we would like some computational procedure for approximating the posterior. One option is to only approximate the posterior at a discrete grid of points \([\theta_1, \theta_2 \ldots, \theta_N]\) or using some sort of numerical integration over \(\theta\), but both of these approaches suffer from the problem that \(\theta\) is typically many-dimensional (since models with only one or two parameters are typically too simple), so calculating \(p(y \mid \theta)p(\theta)\) for a sufficient grid of \(\theta\) values will most often be too expensive. We will now explore a cheaper alternative consisting of second order Taylor approximation of the log-posterior.

More precisely, we will use numerical optimization (such as optim in R) to approximate the peak \(\theta_{\mathrm{MAP}}\) of the log-posterior and then use Taylor approximation around the numerically approximated peak \(\hat{\theta}_{\mathrm{MAP}}\) to approximate the log-posterior. Note that taking the logarithm will flatten the peak a bit but won't change the location of it, since it is a monotone transformation. Since \(\theta_{\mathrm{MAP}}\) is the peak, we have that the gradient in this point is zero, i.e., \[D_{\theta}(\log[p(y \mid \theta)p(\theta)])|_{\theta = \hat{\theta}_{\mathrm{MAP}}} \approx D_{\theta}(\log[p(y \mid \theta)p(\theta)])|_{\theta = \theta_{\mathrm{MAP}}} = 0,\] so we only have to approximate the Hessian \(D^2_{\theta}(\log[p(y \mid \theta)p(\theta)])|_{\theta = \hat{\theta}_{\mathrm{MAP}}}\), which can be done numerically (e.g., with numDeriv::hessian in R). Let's call this Hessian approximation \(\hat{H}(\hat{\theta}_{\mathrm{MAP}})\). Using second order Taylor approximation, we now get

\begin{split} \log[p(\theta \mid y)] &= \log\left[p(y \mid \theta)p(\theta)\right] - \log[\alpha]\\ &\approx \log\left[p(y \mid \hat{\theta}_{\mathrm{MAP}})p(\hat{\theta}_{\mathrm{MAP}})\right] + \frac{\hat{H}(\hat{\theta}_{\mathrm{MAP}})}{2}\left(\theta - \hat{\theta}_{\mathrm{MAP}}\right)^2 - \log{\alpha}\\ &= \log\left[p(\hat{\theta}_{\mathrm{MAP}} \mid y)\right] + \frac{\hat{H}(\hat{\theta}_{\mathrm{MAP}})}{2}\left(\theta - \hat{\theta}_{\mathrm{MAP}}\right)^2. \end{split}

We can transform this back using the exponential function to get the following approximation of the posterior distribution:

\begin{split} p(\theta \mid y) \approx p(\hat{\theta}_{\mathrm{MAP}} \mid y)\exp\left(-\frac{1}{2}\left(-\hat{H}(\hat{\theta}_{\mathrm{MAP}})\right)\left(\theta - \hat{\theta}_{\mathrm{MAP}}\right)^2\right). \end{split}

This is exactly the density of a normal distribution with mean \(\hat{\theta}_{\mathrm{MAP}}\) and variance \(-[\hat{H}(\hat{\theta}_{\mathrm{MAP}})]^{-1}\). Approximating the log-posterior by a second order Taylor polynomial around the mode has given us a Gaussian approximation of the posterior distribution.

Of course one could have chosen to use another transformation than the logarithm before doing Taylor approximation and this would no longer lead to a Gaussian approximation after transforming back, so in a sense we have simply moved the question from why Gaussian? to why log-transformation?. However, this at least provides a starting point and motivates the rather mysterious expression for the Gaussian density as simply being the exponential of a parabola where the approximating Gaussian density uses the parabola given by a second order Taylor approximation.

Connection to maximum likelihood theory

Assume that we chose a flat prior \(p(\theta) \propto 1\). Then the posterior is equal to the likelihood, so the approximation above gives us an approximation of the likelihood (if we don't choose a flat prior, but have so much data that the prior becomes irrelevant compared to the likelihood, then this would also be the case). In this framework we would call \(\theta_{\mathrm{MAP}}\) the maximum likelihood estimate and \(-H(\theta_{\mathrm{MAP}})\) the observed Fisher information (and \(\hat{\theta}_{\mathrm{MAP}}\) and \(-\hat{H}(\hat{\theta}_{\mathrm{MAP}})\) are numerical approximations of them). In the Bayesian framework we focused on \(\theta\) as the random variable of interest in the expression and thereby got the approximate conditional distribution of \(\theta\) given \(y\), but this doesn't make sense in the frequentist framework, since \(\theta\) is seen as the single unknown true value of the parameter. However, if we now focus on the MLE \(\theta_{\mathrm{MAP}}\) as the random variable of interest in the expression, then we notice that the form of the approximate likelihood exactly corresponds to the form of the distribution \(\theta_{\mathrm{MAP}} \sim N(\theta, -H(\theta_{\mathrm{MAP}}))\), which is essentially the classic asymptotic normal distribution used in the maximum likelihood framework. It is more typical to use the Fisher information, but the observed Fisher information can be thought of as an estimate of the Fisher information (and under suitable regularity conditions it will be consistent, by the law of large numbers). This is completely unrigorous (in particular it doesn't make sense that the variance of \(\theta_{\mathrm{MAP}}\) depends on \(\theta_{\mathrm{MAP}}\)), but it is a simple motivation of the asymptotic normal approximation used by frequentists and hints about a connection to second order Taylor approximation of the log-likelihood.

Aside from this, it also gives us (Bayesians) an interpretation of maximum likelihood based analyses. We can think of confidence intervals based on the asymptotic normal distribution above as being posterior intervals calculated with quadratic approximation of the posterior under a flat prior (or with another prior that is weak enough to not matter compared to the likelihood for the given amount of data).

Example in R

The code below illustrates second order Taylor approximation of the log-posterior in a simple example. We draw 20 data points \(y\) from an exponential distribution with rate parameter 2. As a model, we use a flat prior for the rate \(\theta\) and an exponential distribution likelihood \(y_i \mid \theta \stackrel{\mathrm{iid}}{\sim} \mathrm{Exp}(\theta)\). Since we use a flat prior, the posterior is simply the likelihood, so this also serves as illustration for the maximum likelihood framework. Note that it is not necessary to use optim and numDeriv::hessian for the simple example below, but this will allow the code to generalize to more realistic examples where it would not be feasible to find analytical solutions.

  library(ggplot2)
  library(patchwork) # to combine plots

  ## Simulate from exponential distribution with rate 2
  true_rate <- 2
  obs <- rexp(20, rate = true_rate)

  ## Define functions for flat prior, exponential likelihood, and posterior
  log_prior <- \(rate) 0
  log_likelihood <- function(rate_vec)
    purrr::map_dbl(rate_vec, \(rate) sum(dexp(obs, rate, log = TRUE)))
  log_posterior_unnormalized <- \(rate) log_likelihood(rate) + log_prior(rate)

  ## Use optim to find mode of log-posterior
  log_posterior_mode <- optim(par = 1,
                              fn = log_posterior_unnormalized,
                              method = "BFGS",
                              control = list(fnscale = -1)) # find max instead of min
  ## Use numDeriv::hessian to approximate Hessian in mode
  H <- numDeriv::hessian(log_posterior_unnormalized, x = log_posterior_mode$par)

  ## Define 2nd order Taylor approximation of log-posterior
  approx_log_posterior_unnomalized <- function(rate)
    0.5 * H %*% (rate - log_posterior_mode$par)^2 + log_posterior_mode$val

  ## Calculate variance of approximating normal distribution
  var_approx_normal <- -H^(-1)

  ## One can then use the approximating normal distribution
  ## N(log_posterior_mode$par, var_approx_normal). For example, you could sample
  ## from it or directly calculate probability intervals.

We can now plot the true and approximate log-posterior and posterior.

Code for generating plot

  ## Set ggplot theme to be transparent
  theme_set(
    theme_bw() +
    theme(legend.position = "bottom",
          legend.title = element_blank()) +
      theme(
      panel.background = element_rect(fill='transparent'),
      plot.background = element_rect(fill='transparent', color=NA),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      legend.background = element_rect(fill='transparent'),
      legend.box.background = element_rect(fill='transparent'),
      legend.key = element_rect(fill = "transparent"))
      )
  )

  ## Plot of comparison of log-posterior and Taylor approximation
  rate_grid <- seq(-0.5, 5, length.out = 1000)
  p_log <- ggplot() +
    geom_line(aes(x = rate_grid, y = log_posterior_unnormalized(rate_grid),
                  col = "truth")) +
    geom_line(aes(x = rate_grid, y = approx_log_posterior_unnomalized(rate_grid),
                  col = "approximation")) +
    xlab("") +
    ylab("") +
    ggtitle("log-posterior") +
    theme(legend.position = "none")

  ## Plot of comparison of posterior and exp of Taylor approximation
  p_exp <- ggplot() +
    geom_line(aes(x = rate_grid, y = exp(log_posterior_unnormalized(rate_grid)),
                  col = "truth")) +
    geom_line(aes(x = rate_grid, y = exp(approx_log_posterior_unnomalized(rate_grid)),
                  col = "approximation")) +
    xlab("rate") +
    ylab("") +
    ggtitle("posterior") +
    theme(legend.position = "bottom",
          legend.title = element_blank())

  ## Combine and save plots
  ggsave(file = "posterior.png", plot = p_log / p_exp,
         scale = 0.5, type = "cairo-png", bg = "transparent")