8 Model Comparison for Large Data
8.1 Introduction
Comparing Bayesian models on large datasets presents a computational challenge. Full leave-one-out cross-validation (LOO) requires computing \(n\) posteriors and brute force K-fold cross-validation methods become prohibitively expensive as data size grows.
To overcome these challenges, we will use the approach described in Magnusson et al. (2020) to combine fast approximations with targeted, exact computations. The key idea is to use cheap approximations, called surrogates, for all \(n\) observations, and then correct these approximations using exact Pareto smoothed importance sampling (PSIS) LOO-CV on just a small random subsample with a difference estimator that corrects for the approximation error across the full dataset.
Predictive performance is summarized by the expected log predictive density estimated via LOO as
\[ \overline{\operatorname{elpd}}_{\text{loo}} = \frac{1}{n} \sum_{i=1}^n \pi_i, \]
where \(\pi_i = \log p(y_i\mid y_{-i})\) is the LOO predictive density for observation \(i\) and \(y_{-i}\) is the dataset with observation \(i\) removed.
In the example for this chapter using the wells data, we demonstrate the complete workflow for model comparison with subsampled PSIS-LOO-CV. We fit a logistic regression model using both MCMC sampling and Laplace approximations to show how posterior approximations can further reduce computational costs. We implement a lightweight log-likelihood helper that enables efficient evaluation on arbitrary observation subsets, demonstrate subsampled PSIS-LOO-CV using different surrogates (PLPD and LPD), and compare multiple models with subsampled LOO.
8.2 Background
The content in this chapter builds on foundational concepts covered in Chapter 7. For detailed explanations of expected log predictive density (ELPD), leave-one-out cross-validation (LOO-CV), Pareto smoothed importance sampling (PSIS), and model comparison principles, please refer to that chapter. Here we focus specifically on the subsampling techniques and difference estimator needed for large datasets.
8.2.1 From LOO to subsampled LOO
Recall from Chapter 7 that PSIS-LOO-CV uses importance sampling to approximate LOO-CV without refitting the model \(n\) times. The Pareto-\(k\) diagnostic flags unreliable importance sampling approximations when \(k > 0.7\). For large \(n\), even this efficient computation becomes expensive.
The key insight for large data is to combine three elements:
- Fast surrogates \(\tilde\pi_i\) computed for all \(n\) observations (e.g., point predictions)
- Exact PSIS-LOO-CV computed only on a small random subsample \(\mathcal{S}\) of size \(m \ll n\)
- The difference estimator to correct the surrogates using the subsample
These elements dramatically reduce computational cost while maintaining statistical accuracy. We can also extend the importance ratios with the correction term \(p(\theta_s \mid y)/q(\theta_s \mid y)\) when using approximate posteriors from variational inference or Laplace approximations to further reduce computational cost.
A surrogate is simply a computationally inexpensive approximation of each observation’s LOO predictive density, such as evaluating the likelihood at the posterior mean, that doesn’t require refitting the model. The difference estimator then uses the subsample to correct for approximation errors across the full dataset. The result is accurate estimates of predictive performance with a fraction of the computational cost, making model comparison feasible even when \(n\) is very large.
A critical component of the difference estimator is using simple random sampling to select the subsample, rather than model-specific importance sampling schemes. This allows the auxiliary information, e.g., the surrogates \(\tilde\pi_i\), to be used in the estimation stage rather than the sampling stage, meaning the same subsample can be efficiently reused for all models and all quantities of interest. Alternative approaches that incorporate model-specific information during sampling, such as the Hansen-Hurwitz estimator, which we will describe later, require drawing separate subsamples for each model and each comparison, multiplying computational costs.
The quality of surrogate approximations \(\tilde\pi_i\) plays a key role in determining the required subsample size. Surrogates that incorporate model complexity through the effective number of parameters \(p_{\text{eff}} = V_\theta(\log p(y_i \mid \theta))\) tend to approximate the true LOO contributions better than simple point predictions, especially for hierarchical or weakly identified models. This improvement can reduce the required subsample size by orders of magnitude. However, computing better surrogates requires more computation per observation, which creates a fundamental trade-off between surrogate quality and subsample size that should be carefully considered in practice.
8.2.2 Model comparison for large data
When comparing models \(A\) and \(B\), we are interested in the difference in predictive performance
\[ \operatorname{elpd}_D = \operatorname{elpd}_A - \operatorname{elpd}_B, \]
along with its variance
\[ V(\operatorname{elpd}_D) = V(\operatorname{elpd}_A) + V(\operatorname{elpd}_B) - 2\operatorname{Cov}(\operatorname{elpd}_A, \operatorname{elpd}_B). \]
The covariance term is crucial and often overlooked in practice. When models make similar predictions for most observations, which is common when comparing related models, the predictions are highly correlated, making \(\operatorname{Cov}(\operatorname{elpd}_A, \operatorname{elpd}_B)\) large and positive. This correlation dramatically reduces the variance of the difference compared to naively summing individual variances.
This is why using a common subsample across models is essential. Different random subsamples lose this correlation and fail to capture the variance reduction, whereas the same subsample preserves the correlation structure and accurately estimates both \(\operatorname{elpd}_D\) and its much smaller variance, as we will see in the wells example.
8.2.2.1 Previous approaches and practical limitations
Before describing the central difference estimator used in LOO-CV for large data, it is important to understand the limitations of previous methods for scaling LOO to large data and model comparison. One previous approach is the Hansen-Hurwitz (HH) estimator (Hansen and Hurwitz 1943) which uses importance sampling with auxiliary information incorporated at the sampling stage. For estimating \(\operatorname{elpd}_{\text{loo}}\), observations are subsampled with probability proportional to an approximate LOO contribution \(\tilde{\pi}_i\) giving
\[ \widehat{\operatorname{elpd}}_{\text{HH}} = \frac{1}{m} \sum_{j \in \mathcal{S}} \frac{1}{\tilde{\pi}_j} \pi_j, \]
where \(\mathcal{S}\) is the subsample of size \(m\). While the HH estimator works well for individual model evaluation, it has two key limitations for model comparison:
Since the auxiliary information \(\tilde{\pi}_i\) is model-specific and used in the sampling step, each model requires a different subsample. Comparing models or estimating \(V(\operatorname{elpd}_D)\) requires drawing additional subsamples for each quantity of interest, multiplying computational costs.
Using point estimates like \(\log p(y_i \mid \hat{\theta})\) as auxiliary information ignores the effective number of parameters \(p_{\text{eff}}\) and can lead to poor approximations for complex models and requiring larger subsample sizes.
These practical limitations motivate the difference estimator approach described below, which uses the same subsample for all models and incorporates model complexity directly in the surrogate approximation.
8.2.3 Difference estimator for large data
The key to subsampling LOO for large datasets is the difference estimator, which corrects the bias in fast approximations of the full-data LOO. Let \(\pi_i = \log p(y_i\mid y_{-i})\) denote the exact LOO predictive density for observation \(i\), and let \(\tilde\pi_i\) be a computationally efficient approximation. For a simple random subsample \(\mathcal S\) of size \(m\), the difference estimator is given by
\[ \widehat{\operatorname{elpd}}_{\text{diff, loo}} = \sum_{i=1}^n \tilde\pi_i + \frac{n}{m} \sum_{j\in\mathcal S} (\pi_j - \tilde\pi_j). \]
Its subsampling variance is
\[ V\bigl(\widehat{\operatorname{elpd}}_{\text{diff, loo}}\bigr) = n^2\!\left(1-\frac{m}{n}\right) \frac{s_e^2}{m}, \]
where \(s_e^2 = \frac{1}{m-1} \sum_{j\in\mathcal S} (e_j-\bar e)^2\) is the sample variance of the approximation error \(e_j=\pi_j-\tilde\pi_j\), and \(\bar e = \frac{1}{m} \sum_{j\in\mathcal S} e_j\).
The difference estimator has two key properties that make it ideal for large-data model comparison:
The variance decreases as the surrogate quality improves, e.g., as \(\tilde\pi_i \to \pi_i\), the approximation error \(e_j \to 0\) and the subsampling variance vanishes.
The term \((1-m/n)\) is a finite population correction factor that ensures the variance also vanishes as the subsample size approaches the full dataset size. This means we can start with a small subsample and incrementally increase it until we achieve the desired precision, without wasting computational effort.
Crucially, the same subsample can be reused across all models. To estimate model differences \(\operatorname{elpd}_D = \operatorname{elpd}_A - \operatorname{elpd}_B\), we simply apply the difference estimator with difference surrogates \(\tilde\pi_{i,D} = \tilde\pi_{i,A} - \tilde\pi_{i,B}\) and exact differences \(\pi_{j,D} = \pi_{j,A} - \pi_{j,B}\) on the common subsample.
For model comparison, we also need the variability of pointwise ELPD contributions, \(\sigma^2_{\text{loo}} = \tfrac{1}{n}\sum_i (\pi_i-\bar\pi)^2\). Using the same subsample with squared surrogates \(\tilde\pi_i^2\) yields the unbiased estimator
\[ \begin{align} \hat{\sigma}^2_{\text{diff, loo}} = \sum_{i=1}^n \tilde\pi_i^2 &+ \frac{n}{m} \sum_{j\in\mathcal S} (\pi_j^2 - \tilde\pi_j^2) \\ &\quad + \frac{1}{n}\biggl[\Bigl( \frac{n}{m}\sum_{j\in\mathcal S} (\pi_j-\tilde\pi_j) \Bigr)^2 - V\bigl(\widehat{\operatorname{elpd}}_{\text{diff, loo}}\bigr)\biggr] \\ &\quad + \frac{1}{n}\biggl[ 2\Bigl(\sum_{i=1}^n \tilde\pi_i\Bigr)\widehat{\operatorname{elpd}}_{\text{diff, loo}} - \Bigl(\sum_{i=1}^n \tilde\pi_i\Bigr)^2 \biggr]. \end{align} \]
The estimators \(\widehat{\operatorname{elpd}}_{\text{diff, loo}}\) and \(\hat{\sigma}^2_{\text{diff, loo}}\) are unbiased for \(\operatorname{elpd}_{\text{loo}}\) and \(\sigma^2_{\text{loo}}\), respectively (Magnusson et al. 2020). In practice, \(\hat{\sigma}^2_{\text{diff, loo}}\) tends to be optimistic since no general unbiased estimator of the true variability exists in cross-validation (Bengio and Grandvalet 2004).
8.2.4 Fast LOO surrogates
For the difference estimator to be efficient, we need good approximations of the LOO contributions. We want surrogates \(\tilde\pi_i\) that
- Approximate \(\pi_i\) well in finite samples
- Are computationally cheap
- Converge in mean, \(\mathbb{E}|\pi_i - \tilde\pi_i| \to 0\) as \(n \to \infty\)
Notably, the third property ensures favorable scaling characteristics as the data size grows.
8.2.4.1 Practical surrogates: PLPD and LPD
In practice, the function loo_subsample in ArviZ implements two main surrogates.
The first is the point log predictive density (PLPD) and is given by
\[ \tilde\pi_i^{\text{PLPD}} = \log p(y_i \mid \hat\theta), \]
where \(\hat\theta\) is typically the posterior mean. This is computationally cheapest but ignores posterior uncertainty.
The second is the log predictive density (LPD) and is given by
\[ \tilde\pi_i^{\text{LPD}} = \log \left(\frac{1}{S} \sum_{s=1}^S p(y_i \mid \theta_s)\right), \]
which averages over posterior draws before taking the logarithm. The approximation error \(e_j = \pi_j - \tilde\pi_j\) directly affects the subsampling variance, so better surrogates allow smaller subsamples for the same precision.
8.2.5 Asymptotic guarantees and assumptions
The difference estimator has strong theoretical foundations with convergence guarantees under mild regularity conditions. Remarkably, convergence accelerates as data size grows, even with fixed subsample size and fixed number of posterior draws, making the method more efficient precisely when needed most.
The required assumptions are standard regularity conditions satisfied by most common parametric models including regression, GLMs, and hierarchical models. For detailed theoretical treatment of the asymptotic properties and regularity conditions, see Magnusson et al. (2020).
8.2.6 Workflow for model comparison
The workflow for applying subsampled PSIS-LOO-CV to model comparison is straightforward. We compute cheap surrogates for all observations, then refine our estimates using PSIS-LOO-CV on a small subsample, balancing computational efficiency with statistical accuracy:
- Compute posteriors \(p_A(\theta \mid y)\) and \(p_B(\theta \mid y)\) for models \(A\) and \(B\)
- Calculate \(\tilde\pi_i\) for all \(n\) observations using an appropriate approximation
- Use simple random sampling to select \(m\) observations (start with \(m \approx 100\))
- Calculate \(\pi_j\) for the subsample using PSIS-LOO-CV, checking Pareto-\(k\) diagnostics for unreliable importance sampling approximations
- Estimate \(\operatorname{elpd}_A\), \(\operatorname{elpd}_B\), and their difference
- If subsampling SE is too large relative to the estimated difference, increase \(m\) using
update_subsample
Again, the key advantage is that the same subsample can be reused across all models, making the approach highly efficient for comparing multiple models. The difference \(\operatorname{elpd}_D = \operatorname{elpd}_A - \operatorname{elpd}_B\) and its variance are estimated directly from the common subsample, properly accounting for the correlation between models’ predictions.
This reuse is what makes the difference estimator superior to the Hansen-Hurwitz approach for model comparison tasks. This is also the reason why we recommend using the same subsample for all models, rather than subsampling each model separately.
8.3 Wells data and logistic model
To demonstrate the PSIS-LOO-CV subsampling method, we use data from a household survey in a region of Bangladesh where drinking water was contaminated with arsenic. Households with unsafe arsenic levels in their wells were asked whether they would switch to using a neighbor’s safe well. The goal is to predict this binary switching decision using household characteristics.
We focus on a simple logistic regression with two key predictors: the arsenic concentration in the household’s well and the distance to the nearest safe well. The dataset contains \(n = 3020\) observations. While this is not massive by modern standards, it is large enough that efficient computational methods become valuable. The advantage of this sample size is that individual observations have limited influence on the overall model fit, which helps ensure stable importance sampling computations. The data is available from the R package loo.
We model the indicator of switching water wells using a logistic regression with intercept, arsenic concentration, and distance to the nearest safe well in hundreds of metres. The design matrix is \(X = [1, \text{dist}/100, \text{arsenic}]\) and the likelihood is given by
\[ \Pr(y_i = 1 \mid \beta) = \operatorname{logit}^{-1}(X_i^{\mathsf{T}} \beta), \qquad \beta \sim \mathcal{N}(0, I_3). \]
We can fit the model using PyMC or CmdStanPy where posterior draws are stored in a DataTree object.
import pymc as pm
import xarray as xr
with pm.Model():
beta = pm.Normal("beta", mu=0, sigma=1, shape=3)
logit_p = pm.math.dot(X, beta)
pm.Bernoulli("y", logit_p=logit_p, observed=y)
idata = pm.sample(
draws=1000,
tune=1000,
chains=4,
random_seed=4711,
progressbar=True,
idata_kwargs={'log_likelihood': True}
)
data = azp.convert_to_datatree(idata)
data["constant_data"] = xr.Dataset({"X": (["obs_id", "coef"], X)})8.3.1 Constructing the log-likelihood helper
Unlike regular loo(), the routine loo_subsample() expects a callable with signature log_lik_fn(observed, data) depending on the approximation method so that we can compute the log-likelihood for a subset of observations on demand rather than storing the log-likelihood for all observations, which can be computationally expensive for large datasets.
This callable signature is the baseline expectation for loo_subsample(), and we can always wrap auxiliary parameters with tools such as functools.partial so the routine still receives (observed, data).
For logistic regression, the per-draw log probability mass function is given by
\[ \log p(y_i \mid \beta_s, X_i) = y_i \log \pi_{is} + (1 - y_i) \log (1 - \pi_{is}), \qquad \pi_{is} = \operatorname{logit}^{-1}(X_i^{\mathsf{T}} \beta_s). \]
A direct approach is to store the design matrix in the constant_data group so that the helper can directly compute the log-likelihood for whichever subset loo_subsample() hands to it.
This log-likelihood helper evaluates the Bernoulli log-pmf for the selected subset and returns log probabilities for each posterior draw. No log-likelihood values are stored permanently, which keeps memory usage modest even when \(N\) is large.
8.4 Subsampled PSIS-LOO-CV with the PLPD surrogate
As we’ve discussed previously, a key component of the PSIS-LOO-CV subsampling method is the notion of a fast surrogate for the leave-one-out term \(\pi_i\). One such surrogate is the point log predictive density (PLPD) which Magnusson et al. (2020) motivate by replacing the leave-one-out term \(\pi_i\) with
\[ \tilde\pi_i^{\text{plpd}} = \log p(y_i \mid X_i, \hat{\beta}), \quad \hat{\beta} = \mathbb{E}[\beta \mid y]. \]
The approximation reduces the computational burden to a single evaluation per observation while retaining the shrinkage induced by the posterior mean. Inserting \(\tilde\pi_i^{\text{plpd}}\) into the difference estimator converts the exact correction to a subsample average that only requires re-evaluating the full posterior draws on \(m\) observations.
Here we use the method="plpd" argument to loo_subsample() to use the PLPD surrogate.
loo_plpd = azp.loo_subsample(
data=data,
observations=100,
var_name="y",
method="plpd",
log_lik_fn=log_lik_fn,
param_names=["beta"],
pointwise=True,
seed=SEED,
)
loo_plpdComputed from 4000 by 100 subsampled log-likelihood
values from 3020 total observations.
Estimate SE subsampling SE
elpd_loo -1968.4 15.6 0.3
p_loo 3.1
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.70] (good) 100 100.0%
(0.70, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
The output inherits from the ELPDData interface with an additional column subsampling_se isolating the extra Monte Carlo error introduced by drawing only \(100\) of the \(3020\) observations. In our wells run with \(m=100\) we obtained \(\widehat{\operatorname{elpd}}_{\text{loo}}\approx-1968.4\) with \(\text{SE}\approx15.6\) and \(\text{subsampling SE} \approx0.3\). All Pareto‑\(k\) values on the subsample were \(\le 0.7\).
8.5 Subsampled PSIS-LOO-CV with the LPD surrogate
When the posterior mass is diffuse or strongly correlated, evaluating the full posterior predictive density can improve the approximation. We can use the log predictive density (LPD) surrogate given by
\[ \tilde\pi_i^{\text{lpd}} = \log \left( \frac{1}{S} \sum_{s=1}^S p(y_i \mid X_i, \beta_s) \right), \]
where \(\beta_s\) are the draws stored in data.posterior. The additional sum over draws raises the cost and often reduces the approximation error. The same subsample size and random seed ensure that the estimator reuses the indices chosen for the PLPD computation, which makes comparisons straightforward.
loo_lpd = azp.loo_subsample(
data=data,
observations=100,
var_name="y",
method="lpd",
pointwise=True,
seed=SEED,
)
loo_lpdComputed from 4000 by 100 subsampled log-likelihood
values from 3020 total observations.
Estimate SE subsampling SE
elpd_loo -1968.5 15.6 0.4
p_loo 3.2
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.70] (good) 100 100.0%
(0.70, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
From the output, we can see that the estimated ELPD closely matches the PLPD result, both about \(-1968.5\) with \(\text{SE}\approx15.6\), and the subsampling SEs are very similar (PLPD \(\approx0.3\) vs. LPD \(\approx0.4\)).
For well‑identified models with tight posteriors, PLPD can match or even slightly outperform LPD. For more complex models, LPD often gains ground. Taken together, the parallel outputs validate the theoretical trade-off. PLPD minimises computation, whereas LPD expends extra work to tighten the estimator. For larger \(m\) the difference between the two surrogates typically diminishes, and the choice becomes a balance between runtime and stability.
8.6 Incremental sampling updates
Magnusson et al. (2020) recommend increasing \(m\) until the subsampling uncertainty is negligible for the task at hand. The update_subsample() routine efficiently extends an existing subsample without discarding work already done. Here we add 200 additional observations (from 100 to 300 total in the subsample) by using the previous subsample as a starting point.
loo_update = azp.update_subsample(
loo_plpd,
data,
observations=200,
var_name="y",
method="plpd",
log_lik_fn=log_lik_fn,
param_names=["beta"],
seed=SEED,
)
loo_updateComputed from 4000 by 300 subsampled log-likelihood
values from 3020 total observations.
Estimate SE subsampling SE
elpd_loo -1968.4 15.6 0.2
p_loo 3.1
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.70] (good) 300 100.0%
(0.70, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
The updated result shows how the estimator stabilises as more observations are included. In our run, the subsampling SE decreased from about \(0.3\) to \(0.2\), while the estimated ELPD and \(p_{\text{loo}}\) remained essentially unchanged.
8.7 Approximate LOO-CV using PSIS-LOO with posterior approximations
Using posterior approximations such as variational inference or Laplace approximations can further accelerate LOO-CV for large data. These methods avoid the computational cost of full MCMC while maintaining reasonable approximation quality for well-behaved models.
We will use the Laplace approximation which fits a multivariate normal distribution centered at the posterior mode \(\hat{\theta}\) with covariance equal to the inverse Hessian of the negative log posterior. For a model with likelihood \(p(y \mid \theta)\) and prior \(p(\theta)\), the Laplace approximation is
\[ q(\theta \mid y) = \mathcal{N}(\theta \mid \hat{\theta}, \Sigma), \]
where
\[ \hat{\theta} = \arg\max_\theta \log p(\theta \mid y) \quad \text{ and } \quad \Sigma = [-\nabla^2 \log p(\theta \mid y)]^{-1}. \]
Draws from \(q(\theta \mid y)\) can be obtained efficiently by sampling from the fitted normal distribution.
The following fits the wells model using a Laplace approximation via pymc_extras.fit_laplace, which returns draws from the approximate posterior along with the mode and covariance estimates.
import pymc as pm
from pymc_extras import fit_laplace
with pm.Model() as model_laplace:
beta = pm.Normal("beta", mu=0, sigma=1, shape=3)
logit_p = pm.math.dot(X, beta)
pm.Bernoulli("y", logit_p=logit_p, observed=y)
idata_laplace = fit_laplace(
chains=4,
draws=2000,
random_seed=SEED,
progressbar=True
)The fit_laplace function stores the posterior mode in the fit group as mean_vector and the covariance in covariance_matrix. Posterior draws are stored in the posterior group as usual. We convert the result to a DataTree and attach the observed data and design matrix for use with loo_approximate_posterior().
8.7.1 Computing log density corrections
To use loo_approximate_posterior(), we must supply log_p and log_q, which represent the log density of the true posterior and the approximate posterior, respectively. The importance ratios are then adjusted to account for the approximation quality with
\[ r(\theta_s) \propto \frac{1}{p(y_i \mid \theta_s)} \frac{p(\theta_s \mid y)}{q(\theta_s \mid y)}. \]
For the Laplace approximation, \(\log q(\theta_s \mid y)\) is the log density of the fitted normal distribution, while \(\log p(\theta_s \mid y)\) is the log prior plus the log likelihood. We require both of these quantities to be able to compute the importance ratios.
from scipy.stats import multivariate_normal
beta_samples = data_laplace.posterior["beta"]
X_xr = data_laplace.constant_data["X"]
y_xr = data_laplace.observed_data["y"]
param_dim = [d for d in beta_samples.dims if d not in ["chain", "draw"]][0]
beta_renamed = beta_samples.rename({param_dim: "coef"})
logit_pred = xr.dot(X_xr, beta_renamed, dims=["coef"])
prob = 1 / (1 + xr.ufuncs.exp(-logit_pred))
log_lik_laplace = y_xr * xr.ufuncs.log(prob) + (1 - y_xr) * xr.ufuncs.log(1 - prob)
data_laplace["log_likelihood"] = xr.Dataset({"y": log_lik_laplace})
mean_vals = data_laplace.fit["mean_vector"].values
cov_matrix = data_laplace.fit["covariance_matrix"].values
# This applies the multivariate normal log density function to each draw along the last axis
# which is the parameter dimension
log_q_vals = np.apply_along_axis(
lambda b: multivariate_normal.logpdf(b, mean=mean_vals, cov=cov_matrix),
axis=-1,
arr=beta_renamed.values
)
log_q_da = xr.DataArray(
log_q_vals,
dims=["chain", "draw"],
coords={"chain": beta_renamed.chain, "draw": beta_renamed.draw}
)
log_prior = -0.5 * (beta_renamed ** 2).sum("coef") - 1.5 * np.log(2 * np.pi)
log_lik_sum = log_lik_laplace.sum("obs_id")
log_p_da = log_prior + log_lik_sumThe log prior uses the standard normal density \(\mathcal{N}(0, I_3)\) specified in the model. The log likelihood sums over all observations to obtain the joint log density. These quantities are then passed to loo_approximate_posterior() to compute corrected LOO estimates.
8.7.2 Running LOO-CV with approximate posterior
With log_p and log_q in hand, we can compute LOO-CV using the approximate posterior. The loo_approximate_posterior() function adjusts the importance ratios to account for the discrepancy between the true posterior and the Laplace approximation.
loo_ap = azp.loo_approximate_posterior(
data=data_laplace,
log_p=log_p_da,
log_q=log_q_da,
var_name="y",
pointwise=True
)
loo_apComputed from 8000 posterior samples and 3020 observations log-likelihood matrix.
Posterior approximation correction used.
Estimate SE
elpd_loo -1968.42 15.59
p_loo 3.17 -
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.70] (good) 3020 100.0%
(0.70, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
The output shows how the approximate posterior performs relative to MCMC. The Pareto-\(k\) diagnostics remain available and indicate whether the importance sampling correction is reliable. You can also see in the output the line “Posterior approximation correction used”, which indicates that the approximate posterior correction was used.
8.8 Combining approximate posterior with subsampling
The approximate posterior correction can be combined with subsampling to further reduce computational cost.
When using loo_subsample() with an approximate posterior, we simply pass log_p and log_q along with the subsample specification. The difference estimator then corrects for both the approximation error and the subsampling uncertainty.
loo_ap_ss = azp.loo_subsample(
data=data_laplace,
observations=100,
var_name="y",
method="plpd",
log_lik_fn=log_lik_fn,
param_names=["beta"],
log_p=log_p_da,
log_q=log_q_da,
pointwise=True,
seed=SEED
)
loo_ap_ssComputed from 8000 by 100 subsampled log-likelihood
values from 3020 total observations. Posterior approximation correction used.
Estimate SE subsampling SE
elpd_loo -1968.6 15.6 0.4
p_loo 3.3
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.70] (good) 100 100.0%
(0.70, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
The combined approach leverages the strengths of both methods. The Laplace approximation avoids expensive MCMC sampling, while subsampling reduces the cost of evaluating the full dataset. The correction terms ensure the estimates remain unbiased despite the double approximation.
In our wells example, the subsampled approximate posterior LOO produces estimates very close to both the full MCMC and the full approximate posterior results, with negligible subsampling SE. This demonstrates that for well-behaved models, aggressive approximation can yield accurate predictive performance estimates at minimal computational cost.
8.9 Comparing models with subsampled LOO
We can compare models that are estimated via subsampling just like we would for regular LOO using the compare() function. We will fit a second model using log(arsenic) instead of arsenic as a predictor and compare the two models.
X_log = X.copy()
X_log[:, 2] = np.log(X[:, 2])
with pm.Model():
beta2 = pm.Normal("beta", mu=0, sigma=1, shape=3)
logit_p2 = pm.math.dot(X_log, beta2)
pm.Bernoulli("y", logit_p=logit_p2, observed=y)
idata2 = pm.sample(
draws=1000,
tune=1000,
chains=4,
random_seed=SEED,
progressbar=True,
idata_kwargs={'log_likelihood': True}
)
data2 = azp.convert_to_datatree(idata2)
data2["constant_data"] = xr.Dataset({"X": (["obs_id", "coef"], X_log)})8.9.1 Comparison with different subsamples
We first compute subsampled LOO for the second model using a different random seed than the first model. This means the two models will be evaluated on different subsets of observations.
loo_ss_2_diff = azp.loo_subsample(
data=data2,
observations=100,
var_name="y",
method="plpd",
log_lik_fn=log_lik_fn,
param_names=["beta"],
pointwise=True,
seed=315
)
loo_ss_2_diffComputed from 4000 by 100 subsampled log-likelihood
values from 3020 total observations.
Estimate SE subsampling SE
elpd_loo -1952.3 16.2 0.3
p_loo 3.1
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.70] (good) 100 100.0%
(0.70, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
Now we compare the models using different subsamples.
comparison_diff = azp.compare({
"model1_arsenic": loo_plpd,
"model2_log_arsenic": loo_ss_2_diff,
}).round(1)
comparison_diff/opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/arviz_stats/loo/compare.py:265: UserWarning: Different subsamples used in 'model2_log_arsenic' and 'model1_arsenic'. Naive diff SE is used.
warnings.warn(
| rank | elpd | p | elpd_diff | weight | se | dse | warning | subsampling_dse | |
|---|---|---|---|---|---|---|---|---|---|
| model2_log_arsenic | 0 | -1952.3 | 3.1 | 0.0 | 0.5 | 16.2 | 0.0 | False | 0.0 |
| model1_arsenic | 1 | -1968.4 | 3.1 | 16.1 | 0.5 | 15.6 | 22.5 | False | 0.4 |
dse and subsampling_dse
The output is very similar to regular compare() output with an additional column for subsampling uncertainty, subsampling_dse. Note that dse already accounts for the subsampling uncertainty, so we cannot add subsampling_dse to it. We report subsampling_dse to show the exact amount of subsampling uncertainty.
When using different subsamples, notice we get a warning message indicating that the subsamples are different. This warning indicates that the comparison cannot account for the correlation between models’ predictions. In this case, the standard error of the difference dse is computed naively by treating the two models as independent, which substantially inflates the uncertainty because we are losing the benefit of the covariance term \(\operatorname{Cov}(\operatorname{elpd}_A, \operatorname{elpd}_B)\) discussed earlier.
8.9.2 Comparison with the same subsample
To properly compare models, we reuse the exact same observations by passing the first LOO subsample object to the second loo_subsample() call. This ensures both models are evaluated on the same subset of observations, which is critical for accurate model comparison.
loo_ss_2 = azp.loo_subsample(
data=data2,
observations=loo_plpd.loo_subsample_observations,
var_name="y",
method="plpd",
log_lik_fn=log_lik_fn,
param_names=["beta"],
pointwise=True,
)
loo_ss_2Computed from 4000 by 100 subsampled log-likelihood
values from 3020 total observations.
Estimate SE subsampling SE
elpd_loo -1952.2 16.2 0.2
p_loo 3.0
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.70] (good) 100 100.0%
(0.70, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
Now we compare the models using the same subsample.
comparison_same = azp.compare({
"model1_arsenic": loo_plpd,
"model2_log_arsenic": loo_ss_2,
}).round(1)
comparison_same| rank | elpd | p | elpd_diff | weight | se | dse | warning | subsampling_dse | |
|---|---|---|---|---|---|---|---|---|---|
| model2_log_arsenic | 0 | -1952.2 | 3.0 | 0.0 | 0.5 | 16.2 | 0.0 | False | 0.0 |
| model1_arsenic | 1 | -1968.4 | 3.1 | 16.2 | 0.5 | 15.6 | 4.4 | False | 0.1 |
When using the same subsample, we can see that the standard error of the difference is dramatically reduced compared to the previous comparison. This is because the correlation between models’ predictions on the common subsample is properly accounted for in the variance calculation. The SE reduction is substantial, making it much easier to detect meaningful performance differences.
In this example, the second model (with log-transformed arsenic) has higher ELPD than the first model, indicating better predictive performance. The subsampling SE is very small relative to the difference in ELPD, indicating that the subsample size of 100 is sufficient for accurate comparison. If the subsampling SE were large, we could increase the subsample size using update_subsample() as demonstrated earlier.
8.10 Comparing full LOO to subsampled LOO
We can also compare a subsampled LOO object to a full LOO object that was computed on all \(N = 3020\) observations.
The following computes full PSIS-LOO-CV for the first model using the loo() function.
Computed from 4000 posterior samples and 3020 observations log-likelihood matrix.
Estimate SE
elpd_loo -1968.47 15.60
p_loo 3.24 -
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.70] (good) 3020 100.0%
(0.70, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
When comparing a full LOO calculation to a subsampled calculation, the compare() function automatically uses only the observations included in both calculations.
comparison_mixed = azp.compare({
"model1_full_loo": loo_full,
"model2_subsampled_loo": loo_ss_2
}).round(1)
comparison_mixed/opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/arviz_stats/loo/compare.py:308: UserWarning: Estimated elpd_diff using observations included in loo calculations for all models.
warnings.warn(
| rank | elpd | p | elpd_diff | weight | se | dse | warning | subsampling_dse | |
|---|---|---|---|---|---|---|---|---|---|
| model2_subsampled_loo | 0 | -1952.2 | 3.0 | 0.0 | 0.5 | 16.2 | 0.0 | False | 0.0 |
| model1_full_loo | 1 | -1968.5 | 3.2 | 16.3 | 0.5 | 15.6 | 4.4 | False | 0.2 |
Notice that this comparison triggers a warning message indicating that only observations included in both LOO calculations are used for the difference estimate. Since the subsampled LOO only evaluated 100 observations, the comparison is based on those 100 observations rather than the full dataset. We can also see an increase in subsampling_dse compared to the previous comparison which is due to a technical detail that we omit from this example.
8.11 Practical considerations
For moderately sized models, the PLPD surrogate typically suffices because the posterior mean encapsulates most of the predictive information. When the model has hierarchical components or weakly identified parameters, Magnusson et al. (2020) recommend richer surrogates such as the LPD or diagnostics-driven truncation of importance weights to reduce the approximation error \(e_j\) and therefore the subsampling variance. Truncated importance sampling plays a similar role by clipping large \(r_{is}\) values before applying the Pareto smoothing step. Regardless of the surrogate, increasing \(m\) under the difference estimator guarantees convergence because the finite population correction shrinks the variance as \(m\) approaches \(N\).
When comparing multiple models via subsampling, it is recommended to reuse a single subsample across all fits. In this case, the difference estimator produces \(\widehat{\operatorname{ELPD}}_{\text{diff}}\) for each model and the estimated difference \(\widehat{\operatorname{ELPD}}_{A} - \widehat{\operatorname{ELPD}}_{B}\) inherits the same subsampling indices. Consistency results in Magnusson et al. (2020) ensure that even with fixed \(m\) and a finite number of posterior draws, the estimator converges as the quality of the surrogate improves.