Leave-one-out cross-validation and information criteria

This is a summary of some of the points in the references listed at the bottom, in particular those by Vehtari et al.

Evaluating a model's predictive performance

Increasing model complexity will always give a better fit to observed data, but will at some point lead to overfitting, where the model becomes worse at predicting new unseen observations. We would like a way to assess the predictive performance of a model on unseen data.

Assume that we now have a new data point \(\tilde{y}_i\). The posterior predictive density \(p(\tilde{y}_i \mid y)\) is a possible measure of how well our model predicts this. The lower this probability is, the more surprised the model is to see the new observation. We would prefer a model that is not very surprised by the new observation (i.e., a model with high posterior predictive density in the new point). We can rewrite the posterior predictive density as

\begin{equation*} \begin{split} p(\tilde{y}_i \mid y) &= \int p(\tilde{y}_i, \theta \mid y) d\theta\\ &= \int p(\tilde{y}_i \mid \theta, y)p(\theta \mid y) d\theta \\ &= \int p(\tilde{y}_i \mid \theta)p(\theta \mid y) d\theta, \end{split} \end{equation*}

where the last step follows from the assumption that \(\theta\) contains all information about \(p\) that the old observations \(y\) can teach us. As usual, however, we will actually work with the log predictive density \(\log p(\tilde{y}_i \mid y)\) instead.

We are not so interested in the performance on a single new data point, but rather in the expected performance over the distribution of new data points, as given by the expected log predictive density \(E_{\tilde{y}_i}[\log p(\tilde{y}_i \mid y)] = \int p(\tilde{y}_i) \log p(\tilde{y}_i \mid y) d\tilde{y}_i\).

It is typical to assume an entire new data set of iid observations \(\tilde{y}_1, \ldots, \tilde{y}_n\) of the same size as our original data set \(y_1, \ldots, y_n\) and consider the expected log pointwise predictive density

\begin{equation*} \begin{split} \mathrm{elpd} &= E_{\tilde{y}}[\log p(\tilde{y} \mid y)]\\ &= E_{\tilde{y}}\left[\sum_{i = 1}^n \log p(\tilde{y}_i \mid y)\right]\\ &= \sum_{i = 1}^n E_{\tilde{y}_i}[ \log p(\tilde{y}_i \mid y) ]\\ &= \sum_{i = 1}^n \int p(\tilde{y}_i) \log p(\tilde{y}_i \mid y) d\tilde{y}_i. \end{split} \end{equation*}

This is also called the elppd.

Since we don't know \(p(\tilde{y}_i)\), we can't actually calculate the elpd, so we will have to come up with a way to approximate it. We will now go through different ways of doing this.

Performance on existing data

A first, naive, idea is to estimate the elpd using the log pointwise predictive density from the observed data: \(\mathrm{lpd} = \sum_{i = 1}^n \log p(y_i \mid y) = \sum_{i = 1}^n \log \int p(y_i \mid \theta)p(\theta \mid y) d\theta\) (sometimes called lppd). This is easy to approximate from our posterior samples \(\theta_1, \ldots, \theta_S\) by \(\widehat{\mathrm{lpd}} = \sum_{i = 1}^n \log \frac{1}{S} \sum_{s = 1}^S p(y_i \mid \theta_s)\). The approximation works since \(p(y_i \mid \theta_1),\ldots,p(y_i \mid \theta_S)\) are samples from the posterior distribution of \(p(y_i \mid \theta)\), so their average approximates the posterior expectation \(E_{\theta \mid y}[p(y_i \mid \theta)] = \int p(y_i \mid \theta)p(\theta \mid y) d\theta\).

Since the model has been fit to the same data \(y\) that we now use to evaluate it, it is no surprise that the lpd is an overestimate of the elpd. The model will have a harder time predicting unseen data than predicting data that was used to train it. However, we will see that lpd is used as a building block for the more sophisticated estimates of elpd.

Leave-one-out cross-validation (LOO-CV)

Let \(y_{-i}\) denote all other data points than \(y_i\). Then we can treat \(y_i\) as an unseen data point by only fitting the model to \(y_{-i}\), so we can estimate \(p(\tilde{y}_i \mid y)\) by \(p(y_i \mid y_{-i})\). This leads us to the LOO-CV estimate \(\mathrm{elpd_{loo}} = \sum_{i = 1}^n \log p(y_i \mid y_{-i})\). In practice \(\log p(y_i \mid y_{-i})\) is approximated by sampling \(\theta^{-i}_1, \ldots, \theta^{-i}_S\) from the posterior \(p(\theta \mid y_{-i})\) and then calculating \(\log \frac{1}{S} \sum_{i = 1}^S p(y_i \mid \theta^{-i}_s)\), similar to how we estimated lpd above. This is time consuming, since we have to draw \(S\) posterior samples of \(\theta\) from each of the \(n\) different ways of leaving out a data point.

Importance sampling

We will now show how to approximate \(\mathrm{elpd_{loo}}\), in a way that only requires \(S\) draws \(\theta_1, \ldots, \theta_S\) from the full data posterior \(p(\theta \mid y)\). In order to do this, we use importance sampling, which is a Monte Carlo method for approximating expectations. We are interested in approximating the expectation \(p(y_i \mid y_{-i}) = E_{\theta \mid y_{-i}}[p(y_i \mid \theta)]\). A typical Monte Carlo approximation would be to sample \(\theta^i_1, \ldots, \theta^i_S\) from the posterior \(p(\theta \mid y_{-i})\) and do the approximation \(E_{\theta \mid y_{-i}}[p(y_i \mid \theta)] \approx \frac{1}{S} \sum_{s = 1}^S p(y_i \mid \theta^i_s)\). As explained above, we want to do the approximation using samples \(\theta_1, \ldots, \theta_S\) from the full data posterior \(p(\theta \mid y)\) instead, since this will allow us to use the same \(S\) posterior samples for all the other ways to remove a data point as well. This, however, means that we can only approximate expectations taken with respect to the full data posterior \(p(\theta \mid y)\), so we have to find some way of rewriting the expectation above as an expectation under this distribution instead. It turns out that we can scale \(p(y_i \mid \theta)\) by the ratio \(\frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}\) between the target distribution and our proposal distribution and get the desired expectation under our proposal distribution:

\begin{equation*} \begin{split} E_{\theta \mid y}\left[p(y_i \mid \theta) \frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}\right] &= \int p(y_i \mid \theta)\frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}p(\theta \mid y) d\theta\\ &= \int p(y_i \mid \theta) p(\theta \mid y_{-i}) d\theta \\ &= E_{\theta \mid y_{-i}}[p(y_i \mid \theta)]\\ &= p(y_i \mid y_{-i}). \end{split} \end{equation*}

This is the idea behind importance sampling. However, it doesn't immediately help us here, since we would now have to know \(p(\theta \mid y_{-i})\) to be able to calculate the scaling. We have to rewrite the ratio \(\frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}\) to a form that we can actually calculate

\begin{equation*} \begin{split} \frac{p(\theta \mid y_{-i})}{p(\theta \mid y)} &= \frac{\frac{p(y_{-i}, \theta)}{p(y_{-i})}}{\frac{p(y, \theta)}{p(y)}}\\ &= \frac{p(y_{-i}, \theta)}{p(y, \theta)}\frac{p(y)}{p(y_{-i})}\\ &= \left(\frac{p(y, \theta)}{p(y_{-i}, \theta)}\right)^{-1} \frac{p(y)}{p(y_{-i})} \\ &= \left(\frac{p(y_i, y_{-i}, \theta)}{p(y_{-i}, \theta)}\right)^{-1} \frac{p(y)}{p(y_{-i})} \\ &= \left(p(y_i \mid y_{-i}, \theta)\right)^{-1} \frac{p(y)}{p(y_{-i})}\\ &= \left(p(y_i \mid \theta)\right)^{-1} \frac{p(y)}{p(y_{-i})}, \end{split} \end{equation*}

where the last equality relies on the assumption that \(y_1, \ldots, y_n\) are conditionally independent given \(\theta\). We have no way of calculating the ratio \(\frac{p(y)}{p(y_{-i})}\) since this requires knowledge of the true \(p\), but we will see that, since this is just a constant that doesn't depend on \(\theta\), it will disappear from our calculations. The calculations above show that \((p(y_i \mid \theta))^{-1} = \frac{p(y_{-i})}{p(y)}\frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}\), so

\begin{equation*} \begin{split} \frac{E_{\theta \mid y}\left[p(y_i \mid \theta)(p(y_i \mid \theta))^{-1} \right]}{E_{\theta \mid y}[(p(y_i \mid \theta))^{-1}]} &= \frac{\frac{p(y_{-i})}{p(y)}E_{\theta \mid y}\left[p(y_i \mid \theta) \frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}\right]}{\frac{p(y_{-i})}{p(y)} E_{\theta \mid y}\left[\frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}\right]}\\ &= \frac{E_{\theta \mid y}\left[p(y_i \mid \theta) \frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}\right]} {\int \frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}p(\theta \mid y) d\theta}\\ &= \frac{E_{\theta \mid y}\left[p(y_i \mid \theta) \frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}\right]} {\int p(\theta \mid y_{-i}) d\theta}\\ &= E_{\theta \mid y}\left[p(y_i \mid \theta) \frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}\right]\\ &= p(y_i \mid y_{-i}). \end{split} \end{equation*}

This procedure is known as self-normalized importance sampling. We use scaling that is only proportional to the ratio between the target distribution and the proposal distribution, but, by dividing by the expectation of the scaling, the normalization constant cancels and we again get the correct result. To actually get an estimate, we use standard Monte Carlo approximation of the two expectations (as we have already done many times above) to get the approximation \[ p(y_i \mid y_{-i}) = \frac {\frac{1}{S}\sum_{s = 1}^S p(y_i \mid \theta_s)(p(y_i \mid \theta_s))^{-1}} {\frac{1}{S}\sum{s = 1}^S (p(y_i \mid \theta))^{-1}} = \frac {\sum_{s = 1}^S p(y_i \mid \theta_s)r_s} {\sum_{s = 1}^S r_s}, \] where we have introduced the notation \(r_s := (p(y_i \mid \theta_s))^{-1}\) for the \(s\)'th importance weight and where \(\theta_1, \ldots, \theta_S\) are samples from the full data posterior \(p(\theta \mid y)\). Note that the above expression can be simplified further to give \(\frac{1}{\frac{1}{S} \sum_{s = 1}^S r_s}\), but we prefer to keep it as above, since we will later start modifying the weights in a way where this simplification no longer holds.

Plugging this approximation of \(p(y_i \mid y_{-i})\) into the expression for \(\mathrm{elpd_{loo}}\) now gives us: \[ \widehat{\mathrm{elpd}}_{\mathrm{is-loo}} = \sum_{i = 1}^n \log \frac {\sum_{s = 1}^S p(y_i \mid \theta_s)r_s^i} {\sum_{s = 1}^S r_s^i}. \] where we use the notation \(r_s^i := (p(y_i \mid \theta_s))^{-1}\).

Notice that, as promised, this approximation of \(\mathrm{elpd_{loo}}\) only uses \(S\) draws \(\theta_1, \ldots, \theta_S\) from the full data posterior \(p(\theta \mid y)\).

A large problem is that the distribution of the weights may have a heavy right tail (they may get very large) and if the right tail is heavy enough, it may lead to an estimate with infinite variance. One reason for the tail being large is that the full posterior \(p(\theta \mid y)\) is typically narrower with lighter tails than \(p(\theta \mid y_{-i})\), since deleting the \(i\)'th observation makes us more uncertain, so the denominator of \(r_s \propto \frac{p(\theta_s \mid y_{-i})}{p(\theta_s \mid y)}\) can be smaller than the numerator if \(\theta_s\) is in the tail. A single or a few very large weights may have a very large influence on the final estimate. A simple way to limit the mass of the right tail of the weight distribution is to bound the weights above, but this will introduce bias. An alternative is needed.

Pareto smoothed importance sampling (PSIS)

The overall idea of Pareto smoothed importance sampling is to fit a Pareto distribution to the largest importance weights and replace the weights by quantiles from the fitted Pareto distribution. We will now go through it in more detail. We reuse the notation \(r_s\) for the weights from the previous section, but now assume that they have been ordered such that \(r_1\) is the smallest weight and \(r_S\) is the largest weight.

The generalized Pareto distribution has three parameters and pdf

\begin{equation*} p(y \mid u, \sigma, k) = \begin{cases} \frac{1}{\sigma}\left(1 + k\left(\frac{y - u}{\sigma}\right)\right)^{-\frac{1}{k} - 1}, & k \ne 0\\ \frac{1}{\sigma}\exp\left(\frac{y - u}{\sigma}\right), & k = 0. \end{cases} \end{equation*}

\(u\) gives the smallest value in the support of the distribution.
\(\sigma\) gives the scale of the distribution.
\(k\) gives the shape of the distribution and controls how heavy the tail is. When \(k = 0\) it is equivalent to an exponential distribution with mean \(\sigma\), shifted to the right by adding \(u\). When \(k\) gets larger the tail becomes heavier. The first moment exists when \(k < 1\) and the second moment exists when \(k < 1/2\) and, in general, the \(n\)'th moment exists when \(k < 1/n\).

We wish to model the extreme right tail of the distribution of the importance weights. Vehtari et al., 2022, fits to the \(M = \min(0.2S, 3\sqrt{S})\) largest importance weights. This means that we take at least the 20% largest important weights, while \(3\sqrt{S}\) ensures that we don't fit to a large amount of weights when \(S\) is very large (\(3\sqrt{S}\) becomes the minimum when \(S \ge 225\)). As estimate \(\hat{u}\) we simply use the weight \(r_{S - M}\) (that is, the largest weight that is not among the \(M\) largest weights). Afterwards, \(\hat{k}\) and \(\hat{\sigma}\) are estimated by an approximation of their posterior means using a weakly informative prior for regularization.

We have now fit the generalized Pareto distribution to the \(M\) largest weights \(r_{S - M + 1}, \ldots, r_S\). Instead of using these \(M\) weights for our importance sampling, we replace them with quantiles from the fitted generalized Pareto distribution: \[ w_{S - M + i} = F^{-1}\left(\frac{i - 1/2}{M}\right). \] For instance, if \(M = 1000\), then the weights are replaced by the 0.5/1000, 1.5/1000, 2.5/1000, …, 998.5/1000, 999/1000, 999.5/1000 quantiles. These correspond to approximations of the expected order statistics. The first \(S - M\) weights \(r_1, \ldots, r_{S - M}\) are just used as normally, that is, we set \(w_1 := r_1, \ldots, w_{S - M} := r_{S - M}\); we only replace the \(M\) largest weights.

This procedure is done separately for leaving out each of the \(n\) data points, yielding \(n\) sets of importance weights. Let \(w^i_1, \ldots, w^i_S\) denote the weights for leaving out the \(i\)'th data point. By plugging in the new weights to the importance sampling approximation, we now get the PSIS-LOO estimate \[ \widehat{\mathrm{elpd}}_{\mathrm{psis-loo}} = \sum_{i = 1}^n \log \left(\frac{\sum_{s = 1}^S w_s^i p(y_i \mid \theta^s)}{\sum_{s = 1}^S w_s^i}\right). \]

Diagnostics

The fitted \(\hat{k}\) value can serve as a diagnostic for the validity of the importance sampling. Vehtari et al., 2022, show that when \(\hat{k} \le 0.5\) PSIS behaves as if importance ratios have finite variance, so the central limit theorem holds and the convergence rate is \(1/\sqrt{S}\). Even when \(0.5 < \hat{k} \lessapprox 0.7\), PSIS is reliable with low bias and variance of importance sampling approximations, but with slower convergence rate. When \(0.7 \lessapprox \hat{k}\) they don't recommend using PSIS.

Note that, when doing PSIS-LOO, we do a separate importance sampling for each of the \(n\) terms \(p(y_1 \mid y_{-1}), \ldots, p(y_n \mid y_{-n})\) and thus have \(n\) different fitted \(\hat{k}\) values. A large \(\hat{k}\) value when leaving out the \(i\)'th observation, indicates that the \(i\)'th observation is highly influential and that there may be something wrong with the data or the model. This is because the distribution of the importance ratio \(\frac{p(\theta \mid y_{-i})}{p(\theta \mid y)}\) has a heavy right tail when the full data posterior \(p(\theta \mid y)\) is very different from the posterior based on data with the \(i\)'th data point deleted \(p(\theta \mid y_{-i})\). These two posterior distributions will typically be more different the more influential the \(i\)'th observation is.

If there are only a few \(i\) which gives \(\hat{k} > 0.7\) then one can in principle sample from each of these \(p(\theta \mid y_{-i})\) and do the typical posterior predictive approximation \(p(y_{i} \mid y_{-i}) \approx \frac{1}{S}\sum_{s = 1}^S p(y_{i} \mid \theta_s)\) instead of using importance sampling, but still use PSIS-LOO for the rest of the terms. If there are more than \(K\) (where \(K\) is simply some number large enough that one doesn't want to sample manually from each \(p(\theta \mid y_{-i})\)) then one may consider using K-fold cross-validation instead, for a \(K\) that is large enough to make it computationally feasible to fit the model \(\approx \frac{n}{K}\) times. However, if you encounter these problems, then always start by questioning whether something is wrong with your model or your data.

Standard errors

Since PSIS-LOO is the sum of \(n\) independent terms of the form \(\widehat{\mathrm{elpd}}_{\mathrm{psis-loo}, i} := \log \left(\frac{\sum_{s = 1}^S w_s^i p(y_i \mid \theta^s)}{\sum_{s = 1}^S w_s^i}\right)\), we can get a variance estimate as \[ \mathrm{\widehat{var}}\left[\widehat{\mathrm{elpd}}_{\mathrm{psis-loo}}\right] = \sum_{i = 1}^n \mathrm{\widehat{var}}\left[\widehat{\mathrm{elpd}}_{\mathrm{psis-loo}, i} \right] = n \mathrm{\widehat{var}}\left[\widehat{\mathrm{elpd}}_{\mathrm{psis-loo}, i}\right], \] where the variance estimate in the last equality is simply the empirical variance of the \(n\) terms. Taking the square root now gives the standard error.

Note that the \(n\) terms are all computed from the same posterior samples of \(\theta\) and therefore are not really independent. Vehtari et al., 2017, mention that this should not give problems except for small \(n\).

One can also find a Monte Carlo standard error (the uncertainty due to the finite sampling approximation of the posterior), but for these it is typically enough to check that they are small enough to be negligible and otherwise draw more samples from the posterior.

When comparing two models, \(A\) and \(B\), it is of interest how large the difference \(\widehat{\mathrm{elpd}}^A_{\mathrm{psis-loo}} - \widehat{\mathrm{elpd}}^B_{\mathrm{psis-loo}}\) is. A standard error for the difference will help assessing the uncertainty. For example a difference of 10, but with a standard error of 50, will not tell us much about which model gives the best predictions out-of-sample. The variance estimate of the difference can be obtained as

\begin{equation*} \begin{split} \widehat{\mathrm{var}}\left[\widehat{\mathrm{elpd}}^A_{\mathrm{psis-loo}} - \widehat{\mathrm{elpd}}^B_{\mathrm{psis-loo}}\right] &= \sum_{i = 1}^n \mathrm{\widehat{var}}\left[\widehat{\mathrm{elpd}}^A_{\mathrm{psis-loo}, i} - \widehat{\mathrm{elpd}}^B_{\mathrm{psis-loo}, i} \right]\\ &= n \mathrm{\widehat{var}}\left[\widehat{\mathrm{elpd}}^A_{\mathrm{psis-loo}, i} - \widehat{\mathrm{elpd}}^B_{\mathrm{psis-loo}, i}\right]. \end{split} \end{equation*}

Taking the square root gives the standard error.

Note that the same approach can also be used to get standard errors for \(\widehat{\mathrm{elpd}}_{\mathrm{loo}}\).

WAIC

WAIC is given by \[ \widehat{\mathrm{elpd}}_{\mathrm{waic}} = \widehat{\mathrm{lpd}} - \hat{p}_{\mathrm{waic}}, \] where \(p_{\mathrm{waic}}\) is called the effective number of parameters and is theoretically given by \[ p_{\mathrm{waic}} = \sum_{i = 1}^n \mathrm{var}_{\theta \mid y}[\log p(y_i \mid \theta)]. \] Each of the variances can be estimated by taking the empirical variance of \(S\) posterior samples \(\log p(y_i \mid \theta_1), \ldots, \log p(y_i \mid \theta_S)\).

WAIC is asymptotically equivalent to leave-one-out cross-validation, but Vehtari et al., 2017, argue that even though \(p_{\mathrm{waic}}\) gives the asymptotic bias of using lpd to estimate out-of-sample log predictive density, it may be quite wrong for finite sample sizes.

Since both \(\widehat{\mathrm{lpd}}\) and \(\hat{p}_{\mathrm{waic}}\) are sums of \(n\) independent terms, we can get a standard error of \(\widehat{\mathrm{elpd}}_{\mathrm{waic}}\) by collecting the two sums in one and taking the empirical variance as we did for PSIS-LOO above. In a similar way, one can also obtain standard errors for differences in \(\widehat{\mathrm{elpd}}_{\mathrm{waic}}\) between different models.

Note that WAIC may be reported on deviance scale (multiplied by -2).

AIC, DIC, and BIC

AIC tries to estimate \(E_{\tilde{y}}[\log p(\tilde{y} \mid \hat{\theta}(y))]\): the expected out-of-sample predictive accuracy given the maximum likelihood estimate. It estimates it by \(\log p(y \mid \hat{\theta}_{\mathrm{mle}}) - \#\mathrm{parameters}\).

DIC tries to estimate \(E_{\tilde{y}}[\log p(\tilde{y} \mid \hat{\theta}_{\mathrm{Bayes}}(y))]\): the expected out-of-sample predictive accuracy given the posterior mean \(\hat{\theta}_{\mathrm{Bayes}} := E[\theta \mid y]\). It estimates it by \(\log p(y \mid \hat{\theta}_{\mathrm{Bayes}}) - p_\mathrm{DIC}\) where \(p_\mathrm{DIC} = 2\left( \log p(y \mid \hat{\theta}_{\mathrm{Bayes}}) - E_{\theta \mid y}[\log p(y \mid \theta)] \right)\).

AIC and DIC are thus based on point estimates, whereas WAIC uses information from the full posterior distribution. It is unsatisfying to throw information from the estimated posterior away. Furthermore, they are not calculated pointwise for each data point, so we can't get standard errors like for PSIS-LOO and WAIC. There seems to be no reason to use AIC and DIC, given that WAIC and PSIS-LOO have been invented and are computable.

AIC and DIC are typically reported on deviance scale (multiplied by -2).

BIC is motivated by \(p(y)\), the marginal probability of the data, and is thus very different from the other measures considered here and should not be used for evaluating out-of-sample predictive accuracy.

Connection to information theory

Let \(p\) be the truth and let \(p^A_\mathrm{post}\) and \(p^B_\mathrm{post}\) be the posteriors from two models \(A\) and \(B\). Then the difference between their Kullback-Leibler divergences from the truth is

\begin{equation*} \begin{split} D_{\mathrm{KL}}(p, p^A_\mathrm{post}) - D_{\mathrm{KL}}(p, p^B_\mathrm{post}) &= \int p(y) \log p(y) dy - \int p(y) \log p^A_\mathrm{post}(y) dy - \\ &\left(\int p(y) \log p(y) dy - \int p(y) \log p^B_\mathrm{post}(y) dy\right)\\ &= \int p(y) \log p^B_\mathrm{post}(y) dy - \int p(y) \log p^A_\mathrm{post}(y) dy\\ &= \mathrm{elpd}^B_1 - \mathrm{elpd}^A_1, \end{split} \end{equation*}

where I use the subscript 1 to denote that it is only the expected log predictive density for a single new data point and not for a full data set of \(n\) points. This means that \(\mathrm{elpd}_1 = \frac{1}{n}\mathrm{elpd}\), so we now have that \[ D_{\mathrm{KL}}(p, p^A_\mathrm{post}) - D_{\mathrm{KL}}(p, p^B_\mathrm{post}) = \frac{\mathrm{elpd}^B - \mathrm{elpd}^A}{n}. \]

This means that the difference in elpd (scaled by \(1 / n\)) directly gives us the reverse difference in Kullback-Leibler divergence. The larger \(\mathrm{elpd}^B\) is relative to \(\mathrm{elpd}^A\), the further \(p^A_{\mathrm{post}}\) is from the true \(p\) relative to how far \(p^B_{\mathrm{post}}\) is from the true \(p\).

Notice that this enables us to estimate the difference in the Kullback-Leibler divergences between two models and the truth, even though we can't estimate any of the two Kullback-Leibler divergences themselves, because we can't calculate the entropy \(\int p(y)\log p(y) dy\) due to not knowing the truth \(p(y)\).

Sometimes one reports deviance instead of elpd. Deviance is defined as \(-2 \cdot \mathrm{elpd}\) such that smaller means better (similar to \(D_{\mathrm{KL}}\)). The "-2" is to get an approximate chi-squared distribution of differences in deviance in certain cases.

Which one should you use?

PSIS-LOO and WAIC are both good choices. They both utilize information from the full posterior distribution to estimate the elpd. The \(\hat{k}\) diagnostic from PSIS-LOO, which allows us to diagnoze the reliability of the approximation, makes me prefer it over WAIC.

Summary

elpd (expected log pointwise predictive density)
- \(\sum_{i = 1}^n E_{\tilde{y}_i}[ \log p(\tilde{y}_i \mid y) ] = \sum_{i = 1}^n \int p(\tilde{y}_i) \log p(\tilde{y}_i \mid y) d\tilde{y}_i\)
- Larger is better.
- Is our overall measure of model performance on unseen data.
- Can't be calculated, since the true model \(p(\tilde{y}_i)\) is unknown, so we have to estimate it with cross-validation or an information criterium.
lpd (log pointwise predictive density)
- \(\sum_{i = 1}^n \log p(y_i \mid y) = \sum_{i = 1}^n \log \int p(y_i \mid \theta)p(\theta \mid y) d\theta\)
- Approximated by \(\sum_{i = 1}^n \log \frac{1}{S} \sum_{s = 1}^S p(y_i \mid \theta_s)\).
- Overestimate of elpd, since model predicts seen data better than unseen data.
- Building block of more sophisticated estimates of elpd.
\(\mathrm{elpd_{loo}}\) (leave-one-out)
- \(\sum_{i = 1}^n \log p(y_i \mid y_{-i})\)
- Approximated by \(\sum_{i = 1}^n \log \frac{1}{S} \sum_{i = 1}^S p(y_i \mid \theta^{-i}_s)\)
- Time consuming since it requires \(S\) samples \(\theta_1, \ldots, \theta_S\) from each of the \(n\) posteriors \(p(\theta \mid y_{-1}), \ldots, p(\theta \mid y_{-n})\).
- Calculated pointwise, so we can get standard errors directly from empirical variance of terms of sum.
\(\widehat{\mathrm{elpd}}_{\mathrm{is-loo}}\) (importance sampling leave-one-out)
- \(\sum_{i = 1}^n \log \frac {\sum_{s = 1}^S p(y_i \mid \theta_s)r_s^i}{\sum_{s = 1}^S r_s^i}\), where \(r_s^i = (p(y_i \mid \theta_s))^{-1}\).
- Only requires \(S\) samples \(\theta_1, \ldots, \theta_S\) from the full data posterior \(p(\theta \mid y)\).
- Problem: The distribution of \(r_s\) may have a heavy right tail, which can even lead to infinite variance of \(\widehat{\mathrm{elpd}}_{\mathrm{is-loo}}\).
- Calculated pointwise, so we can get standard errors directly from empirical variance of terms of sum.
\(\widehat{\mathrm{elpd}}_{\mathrm{psis-loo}}\) (Pareto smoothed importance sampling leave-one-out)
- \(\sum_{i = 1}^n \log \left(\frac{\sum_{s = 1}^S w_s^i p(y_i \mid \theta^s)}{\sum_{s = 1}^S w_s^i}\right)\), where, with \(M = \min(0.2S, 3\sqrt{S})\), the smallest \(S - M\) weights are the same as for IS-LOO, but the largest \(M\) weights are replaced by quantiles from a fitted generalized Pareto distribution to try to avoid problems with heavy-tailed distribution of weights.
- The fitted shape parameter \(\hat{k}\) from the generalized Pareto distribution serves as diagnostic of fit. The larger \(\hat{k}\) is, the more heavy-tailed the fitted generalized Pareto distribution is.
  - \(\hat{k} \le 0.5\): Very good. Behaves as if the importance ratios have finite variance.
  - \(0.5 < \hat{k} \lessapprox 0.7\): Still fine, but convergence rate may be slower.
  - \(0.7 \lessapprox \hat{k}\): Don't use PSIS.
- Calculated pointwise, so we can get standard errors directly from empirical variance of terms of sum.
\(\widehat{\mathrm{elpd}}_{\mathrm{waic}}\)
- \(\widehat{\mathrm{lpd}} - \sum_{i = 1}^n \mathrm{var}_{\theta \mid y}[\log p(y_i \mid \theta)]\)
- In practice the variances in the second term (known as the effective number of parameters) are typically estimated by the empirical variance of posterior samples.
- Asymptotically equivalent to LOO-CV, but may be quite wrong for finite sample size.
- Calculated pointwise, so we can get standard errors directly from empirical variance of terms of sum.
- May be reported on deviance scale (multiplied by -2).
AIC and DIC
- Doesn't try to estimate the full Bayesian elpd, but rather expected log predictive density given a point estimate (the mle and posterior mean, respectively).
- Don't use all information from the posterior distribution.
- No reason to use them over PSIS-LOO and WAIC.
- AIC estimate of the expected log predictive density given the mle: \(\log p(y \mid \hat{\theta}_{\mathrm{mle}}) - \#\mathrm{parameters}\)
- DIC estimate of the expected log predictive density given the posterior mean: \(\log p(y \mid \hat{\theta}_{\mathrm{Bayes}}) - p_\mathrm{DIC}\) where \(p_\mathrm{DIC} = 2\left( \log p(y \mid \hat{\theta}_{\mathrm{Bayes}}) - E_{\theta \mid y}[\log p(y \mid \theta)] \right)\)
- Typically reported on deviance scale (multiplied by -2).
BIC
- Is not trying to estimate the out-of-sample predictive accuracy, so not comparable to the other measures.
Deviance is \(-2 \cdot \mathrm{elpd}\).
- "-2" is to get a chi-squared distribution of deviance differences in certain cases.
- Smaller is better.
Connection between elpd and Kullback-Leibler divergence:
- \(D_{\mathrm{KL}}(p, p^A_\mathrm{post}) - D_{\mathrm{KL}}(p, p^B_\mathrm{post}) = \frac{\mathrm{elpd}^B - \mathrm{elpd}^A}{n}\), where \(A\) and \(B\) are two models and \(p\) is the truth.

References

Vehtari et al., 2022, Pareto Smoothed Importance Sampling, https://arxiv.org/abs/1507.02646, current version: v8. v1 is from 2015.
Vehtari et al., 2017, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC, https://link.springer.com/article/10.1007/s11222-016-9696-4.
Gelman et al., 2014, Understanding predictive information criteria for Bayesian models, https://link.springer.com/article/10.1007/s11222-013-9416-2. Note that this is essentially the same as Chapter 7 of the book below.
Gelman et al., 2014, Bayesian Data Analysis, Third Edition, http://www.stat.columbia.edu/~gelman/book/.
Richard McElreath, 2020, Statistical Rethinking, Second Edition, https://xcelab.net/rm/statistical-rethinking/.
Alan E Gelfand, 1995, Model determination using sampling based methods. This is chapter 9 from the book Markov Chain Monte Carlo in Practice, 1995, edited by Gilks et al.