8  Model Comparison for Large Data

In this chapter, we demonstrate how to efficiently compare Bayesian models on large datasets using subsampled PSIS-LOO-CV. We apply the methods introduced in Section 7.8 to a real dataset containing thousands of observations, showing how to balance computational efficiency with statistical accuracy. The workflow combines fast surrogate approximations with exact computations on a small subsample, and can be further accelerated using approximate posteriors from variational inference or Laplace approximations. For theoretical background and details of the method, we recommend you read Section 7.8 and references therein.

8.1 Workflow for model comparison

The workflow for applying subsampled PSIS-LOO-CV to model comparison is straightforward. We compute cheap surrogates for all observations, then refine our estimates using PSIS-LOO-CV on a small subsample, balancing computational efficiency with statistical accuracy:

  1. Compute posteriors \(p_A(\theta \mid y)\) and \(p_B(\theta \mid y)\) for models \(A\) and \(B\)
  2. Calculate \(\tilde\pi_i\) for all \(n\) observations using an appropriate approximation
  3. Use simple random sampling to select \(m\) observations (start with \(m \approx 100\))
  4. Calculate \(\pi_j\) for the subsample using PSIS-LOO-CV, checking Pareto-\(k\) diagnostics for unreliable importance sampling approximations
  5. Estimate \(\operatorname{elpd}_A\), \(\operatorname{elpd}_B\), and their difference
  6. If subsampling SE is too large relative to the estimated difference, increase \(m\) using update_subsample

Again, the key advantage is that the same subsample can be reused across all models, making the approach highly efficient for comparing multiple models. The difference \(\operatorname{elpd}_D = \operatorname{elpd}_A - \operatorname{elpd}_B\) and its variance are estimated directly from the common subsample, properly accounting for the correlation between models’ predictions.

This reuse is what makes the difference estimator superior to the Hansen-Hurwitz approach for model comparison tasks. This is also the reason why we recommend using the same subsample for all models, rather than subsampling each model separately.

8.2 Wells data and logistic model

To demonstrate the PSIS-LOO-CV subsampling method, we use data from a household survey in a region of Bangladesh where drinking water was contaminated with arsenic. Households with unsafe arsenic levels in their wells were asked whether they would switch to using a neighbor’s safe well. The goal is to predict this binary switching decision using household characteristics.

We focus on a simple logistic regression with two key predictors: the arsenic concentration in the household’s well and the distance to the nearest safe well. The dataset contains \(n = 3020\) observations. While this is not massive by modern standards, it is large enough that efficient computational methods become valuable. The advantage of this sample size is that individual observations have limited influence on the overall model fit, which helps ensure stable importance sampling computations. The data is available from the R package loo.

We model the indicator of switching water wells using a logistic regression with intercept, arsenic concentration, and distance to the nearest safe well in hundreds of metres. The design matrix is \(X = [1, \text{dist}/100, \text{arsenic}]\) and the likelihood is given by

\[ \Pr(y_i = 1 \mid \beta) = \operatorname{logit}^{-1}(X_i^{\mathsf{T}} \beta), \qquad \beta \sim \mathcal{N}(0, I_3). \]

import pandas as pd
import numpy as np

wells = pd.read_csv("../data/wells.csv")
wells["dist100"] = wells["dist"] / 100

X = np.column_stack([
    np.ones(len(wells)),        
    wells["dist100"].values,     
    wells["arsenic"].values 
])
y = wells["switch"].values

We can fit the model using PyMC or CmdStanPy where posterior draws are stored in a DataTree object.

import pymc as pm
import xarray as xr

with pm.Model():
    beta = pm.Normal("beta", mu=0, sigma=1, shape=3)
    logit_p = pm.math.dot(X, beta)
    pm.Bernoulli("y", logit_p=logit_p, observed=y)

    idata = pm.sample(
        draws=1000,
        tune=1000,
        chains=4,
        random_seed=4711,
        progressbar=True,
        idata_kwargs={'log_likelihood': True}
    )

data = azp.convert_to_datatree(idata)
data["constant_data"] = xr.Dataset({"X": (["obs_id", "coef"], X)})
# CmdStanPy implementation will be added in future

8.2.1 Constructing the log-likelihood helper

Unlike regular loo(), the routine loo_subsample() expects a callable with signature log_lik_fn(observed, data) depending on the approximation method so that we can compute the log-likelihood for a subset of observations on demand rather than storing the log-likelihood for all observations, which can be computationally expensive for large datasets.

This callable signature is the baseline expectation for loo_subsample(), and we can always wrap auxiliary parameters with tools such as functools.partial so the routine still receives (observed, data).

For logistic regression, the per-draw log probability mass function is given by

\[ \log p(y_i \mid \beta_s, X_i) = y_i \log \pi_{is} + (1 - y_i) \log (1 - \pi_{is}), \qquad \pi_{is} = \operatorname{logit}^{-1}(X_i^{\mathsf{T}} \beta_s). \]

A direct approach is to store the design matrix in the constant_data group so that the helper can directly compute the log-likelihood for whichever subset loo_subsample() hands to it.

from scipy import stats
from scipy.special import expit

def log_lik_fn(obs, data):
    beta = data.posterior["beta"].values
    X = data.constant_data["X"].values
    logit_pred = X @ beta
    prob = expit(logit_pred)
    return stats.bernoulli.logpmf(obs, prob)

This log-likelihood helper evaluates the Bernoulli log-pmf for the selected subset and returns log probabilities for each posterior draw. No log-likelihood values are stored permanently, which keeps memory usage modest even when \(N\) is large.

8.3 Subsampled PSIS-LOO-CV with the PLPD surrogate

As we’ve discussed previously, a key component of the PSIS-LOO-CV subsampling method is the notion of a fast surrogate for the leave-one-out term \(\pi_i\). One such surrogate is the point log predictive density (PLPD) which Magnusson et al. (2020) motivate by replacing the leave-one-out term \(\pi_i\) with

\[ \tilde\pi_i^{\text{plpd}} = \log p(y_i \mid X_i, \hat{\beta}), \quad \hat{\beta} = \mathbb{E}[\beta \mid y]. \]

The approximation reduces the computational burden to a single evaluation per observation while retaining the shrinkage induced by the posterior mean. Inserting \(\tilde\pi_i^{\text{plpd}}\) into the difference estimator converts the exact correction to a subsample average that only requires re-evaluating the full posterior draws on \(m\) observations.

Here we use the method="plpd" argument to loo_subsample() to use the PLPD surrogate.

loo_plpd = azp.loo_subsample(
    data=data,
    observations=100,
    var_name="y",
    method="plpd",
    log_lik_fn=log_lik_fn,
    param_names=["beta"],
    pointwise=True,
    seed=SEED,
)

loo_plpd
Computed from 4000 by 100 subsampled log-likelihood
values from 3020 total observations.

         Estimate   SE subsampling SE
elpd_loo   -1968.4 15.6            0.3
p_loo          3.1

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      100  100.0%
   (0.70, 1]   (bad)         0    0.0%
    (1, Inf)   (very bad)    0    0.0%

The output inherits from the ELPDData interface with an additional column subsampling_se isolating the extra Monte Carlo error introduced by drawing only \(100\) of the \(3020\) observations. In our wells run with \(m=100\) we obtained \(\widehat{\operatorname{elpd}}_{\text{loo}}\approx-1968.4\) with \(\text{SE}\approx15.6\) and \(\text{subsampling SE} \approx0.3\). All Pareto‑\(k\) values on the subsample were \(\le 0.7\).

8.4 Subsampled PSIS-LOO-CV with the LPD surrogate

When the posterior mass is diffuse or strongly correlated, evaluating the full posterior predictive density can improve the approximation. We can use the log predictive density (LPD) surrogate given by

\[ \tilde\pi_i^{\text{lpd}} = \log \left( \frac{1}{S} \sum_{s=1}^S p(y_i \mid X_i, \beta_s) \right), \]

where \(\beta_s\) are the draws stored in data.posterior. The additional sum over draws raises the cost and often reduces the approximation error. The same subsample size and random seed ensure that the estimator reuses the indices chosen for the PLPD computation, which makes comparisons straightforward.

loo_lpd = azp.loo_subsample(
    data=data,
    observations=100,
    var_name="y",
    method="lpd",
    pointwise=True,
    seed=SEED,
)

loo_lpd
Computed from 4000 by 100 subsampled log-likelihood
values from 3020 total observations.

         Estimate   SE subsampling SE
elpd_loo   -1968.5 15.6            0.4
p_loo          3.2

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      100  100.0%
   (0.70, 1]   (bad)         0    0.0%
    (1, Inf)   (very bad)    0    0.0%

From the output, we can see that the estimated ELPD closely matches the PLPD result, both about \(-1968.5\) with \(\text{SE}\approx15.6\), and the subsampling SEs are very similar (PLPD \(\approx0.3\) vs. LPD \(\approx0.4\)).

For well‑identified models with tight posteriors, PLPD can match or even slightly outperform LPD. For more complex models, LPD often gains ground. Taken together, the parallel outputs validate the theoretical trade-off. PLPD minimises computation, whereas LPD expends extra work to tighten the estimator. For larger \(m\) the difference between the two surrogates typically diminishes, and the choice becomes a balance between runtime and stability.

8.5 Incremental sampling updates

Magnusson et al. (2020) recommend increasing \(m\) until the subsampling uncertainty is negligible for the task at hand. The update_subsample() routine efficiently extends an existing subsample without discarding work already done. Here we add 200 additional observations (from 100 to 300 total in the subsample) by using the previous subsample as a starting point.

loo_update = azp.update_subsample(
    loo_plpd,
    data,
    observations=200,
    var_name="y",
    method="plpd",
    log_lik_fn=log_lik_fn,
    param_names=["beta"],
    seed=SEED,
)

loo_update
Computed from 4000 by 300 subsampled log-likelihood
values from 3020 total observations.

         Estimate   SE subsampling SE
elpd_loo   -1968.4 15.6            0.2
p_loo          3.1

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      300  100.0%
   (0.70, 1]   (bad)         0    0.0%
    (1, Inf)   (very bad)    0    0.0%

The updated result shows how the estimator stabilises as more observations are included. In our run, the subsampling SE decreased from about \(0.3\) to \(0.2\), while the estimated ELPD and \(p_{\text{loo}}\) remained essentially unchanged.

8.6 Approximate LOO-CV using PSIS-LOO with posterior approximations

Using posterior approximations such as variational inference or Laplace approximations can further accelerate LOO-CV for large data. These methods avoid the computational cost of full MCMC while maintaining reasonable approximation quality for well-behaved models.

We will use the Laplace approximation which fits a multivariate normal distribution centered at the posterior mode \(\hat{\theta}\) with covariance equal to the inverse Hessian of the negative log posterior. For a model with likelihood \(p(y \mid \theta)\) and prior \(p(\theta)\), the Laplace approximation is

\[ q(\theta \mid y) = \mathcal{N}(\theta \mid \hat{\theta}, \Sigma), \]

where

\[ \hat{\theta} = \arg\max_\theta \log p(\theta \mid y) \quad \text{ and } \quad \Sigma = [-\nabla^2 \log p(\theta \mid y)]^{-1}. \]

Draws from \(q(\theta \mid y)\) can be obtained efficiently by sampling from the fitted normal distribution.

The following fits the wells model using a Laplace approximation via pymc_extras.fit_laplace, which returns draws from the approximate posterior along with the mode and covariance estimates.

import pymc as pm
from pymc_extras import fit_laplace

with pm.Model() as model_laplace:
    beta = pm.Normal("beta", mu=0, sigma=1, shape=3)
    logit_p = pm.math.dot(X, beta)
    pm.Bernoulli("y", logit_p=logit_p, observed=y)

    idata_laplace = fit_laplace(
        chains=4,
        draws=2000,
        random_seed=SEED,
        progressbar=True
    )
# CmdStanPy implementation will be added in future

The fit_laplace function stores the posterior mode in the fit group as mean_vector and the covariance in covariance_matrix. Posterior draws are stored in the posterior group as usual. We convert the result to a DataTree and attach the observed data and design matrix for use with loo_approximate_posterior().

data_laplace = azp.convert_to_datatree(idata_laplace)

data_laplace["observed_data"] = xr.Dataset({
    "y": (["obs_id"], y, {"obs_id": range(len(y))})
})
data_laplace["constant_data"] = xr.Dataset({"X": (["obs_id", "coef"], X)})

8.6.1 Computing log density corrections

To use loo_approximate_posterior(), we must supply log_p and log_q, which represent the log density of the true posterior and the approximate posterior, respectively. The importance ratios are then adjusted to account for the approximation quality with

\[ r(\theta_s) \propto \frac{1}{p(y_i \mid \theta_s)} \frac{p(\theta_s \mid y)}{q(\theta_s \mid y)}. \]

For the Laplace approximation, \(\log q(\theta_s \mid y)\) is the log density of the fitted normal distribution, while \(\log p(\theta_s \mid y)\) is the log prior plus the log likelihood. We require both of these quantities to be able to compute the importance ratios.

from scipy.stats import multivariate_normal

beta_samples = data_laplace.posterior["beta"]
X_xr = data_laplace.constant_data["X"]
y_xr = data_laplace.observed_data["y"]

param_dim = [d for d in beta_samples.dims if d not in ["chain", "draw"]][0]
beta_renamed = beta_samples.rename({param_dim: "coef"})

logit_pred = xr.dot(X_xr, beta_renamed, dims=["coef"])
prob = 1 / (1 + xr.ufuncs.exp(-logit_pred))
log_lik_laplace = y_xr * xr.ufuncs.log(prob) + (1 - y_xr) * xr.ufuncs.log(1 - prob)

data_laplace["log_likelihood"] = xr.Dataset({"y": log_lik_laplace})

mean_vals = data_laplace.fit["mean_vector"].values
cov_matrix = data_laplace.fit["covariance_matrix"].values

# This applies the multivariate normal log density function to each draw along the last axis
# which is the parameter dimension
log_q_vals = np.apply_along_axis(
    lambda b: multivariate_normal.logpdf(b, mean=mean_vals, cov=cov_matrix),
    axis=-1,
    arr=beta_renamed.values
)
log_q_da = xr.DataArray(
    log_q_vals,
    dims=["chain", "draw"],
    coords={"chain": beta_renamed.chain, "draw": beta_renamed.draw}
)

log_prior = -0.5 * (beta_renamed ** 2).sum("coef") - 1.5 * np.log(2 * np.pi)
log_lik_sum = log_lik_laplace.sum("obs_id")
log_p_da = log_prior + log_lik_sum

The log prior uses the standard normal density \(\mathcal{N}(0, I_3)\) specified in the model. The log likelihood sums over all observations to obtain the joint log density. These quantities are then passed to loo_approximate_posterior() to compute corrected LOO estimates.

8.6.2 Running LOO-CV with approximate posterior

With log_p and log_q in hand, we can compute LOO-CV using the approximate posterior. The loo_approximate_posterior() function adjusts the importance ratios to account for the discrepancy between the true posterior and the Laplace approximation.

loo_ap = azp.loo_approximate_posterior(
    data=data_laplace,
    log_p=log_p_da,
    log_q=log_q_da,
    var_name="y",
    pointwise=True
)

loo_ap
Computed from 8000 posterior samples and 3020 observations log-likelihood matrix.
Posterior approximation correction used.

         Estimate       SE
elpd_loo -1968.42    15.59
p_loo        3.17        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)     3020  100.0%
   (0.70, 1]   (bad)         0    0.0%
    (1, Inf)   (very bad)    0    0.0%

The output shows how the approximate posterior performs relative to MCMC. The Pareto-\(k\) diagnostics remain available and indicate whether the importance sampling correction is reliable. You can also see in the output the line “Posterior approximation correction used”, which indicates that the approximate posterior correction was used.

8.7 Combining approximate posterior with subsampling

The approximate posterior correction can be combined with subsampling to further reduce computational cost.

When using loo_subsample() with an approximate posterior, we simply pass log_p and log_q along with the subsample specification. The difference estimator then corrects for both the approximation error and the subsampling uncertainty.

loo_ap_ss = azp.loo_subsample(
    data=data_laplace,
    observations=100,
    var_name="y",
    method="plpd",
    log_lik_fn=log_lik_fn,
    param_names=["beta"],
    log_p=log_p_da,
    log_q=log_q_da,
    pointwise=True,
    seed=SEED
)

loo_ap_ss
Computed from 8000 by 100 subsampled log-likelihood
values from 3020 total observations. Posterior approximation correction used.

         Estimate   SE subsampling SE
elpd_loo   -1968.6 15.6            0.4
p_loo          3.3

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      100  100.0%
   (0.70, 1]   (bad)         0    0.0%
    (1, Inf)   (very bad)    0    0.0%

The combined approach leverages the strengths of both methods. The Laplace approximation avoids expensive MCMC sampling, while subsampling reduces the cost of evaluating the full dataset. The correction terms ensure the estimates remain unbiased despite the double approximation.

In our wells example, the subsampled approximate posterior LOO produces estimates very close to both the full MCMC and the full approximate posterior results, with negligible subsampling SE. This demonstrates that for well-behaved models, aggressive approximation can yield accurate predictive performance estimates at minimal computational cost.

8.8 Comparing models with subsampled LOO

We can compare models that are estimated via subsampling just like we would for regular LOO using the compare() function. We will fit a second model using log(arsenic) instead of arsenic as a predictor and compare the two models.

X_log = X.copy()
X_log[:, 2] = np.log(X[:, 2])

with pm.Model():
    beta2 = pm.Normal("beta", mu=0, sigma=1, shape=3)
    logit_p2 = pm.math.dot(X_log, beta2)
    pm.Bernoulli("y", logit_p=logit_p2, observed=y)

    idata2 = pm.sample(
        draws=1000,
        tune=1000,
        chains=4,
        random_seed=SEED,
        progressbar=True,
        idata_kwargs={'log_likelihood': True}
    )

data2 = azp.convert_to_datatree(idata2)
data2["constant_data"] = xr.Dataset({"X": (["obs_id", "coef"], X_log)})
# CmdStanPy implementation will be added in future

8.8.1 Comparison with different subsamples

We first compute subsampled LOO for the second model using a different random seed than the first model. This means the two models will be evaluated on different subsets of observations.

loo_ss_2_diff = azp.loo_subsample(
    data=data2,
    observations=100,
    var_name="y",
    method="plpd",
    log_lik_fn=log_lik_fn,
    param_names=["beta"],
    pointwise=True,
    seed=315
)

loo_ss_2_diff
Computed from 4000 by 100 subsampled log-likelihood
values from 3020 total observations.

         Estimate   SE subsampling SE
elpd_loo   -1952.3 16.2            0.3
p_loo          3.1

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      100  100.0%
   (0.70, 1]   (bad)         0    0.0%
    (1, Inf)   (very bad)    0    0.0%

Now we compare the models using different subsamples.

comparison_diff = azp.compare({
    "model1_arsenic": loo_plpd,
    "model2_log_arsenic": loo_ss_2_diff,
}).round(1)

comparison_diff
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/arviz_stats/loo/compare.py:279: UserWarning: Different subsamples used in 'model2_log_arsenic' and 'model1_arsenic'. Naive diff SE is used.
  warnings.warn(
rank elpd p elpd_diff weight se dse warning subsampling_dse
model2_log_arsenic 0 -1952.3 3.1 0.0 0.5 16.2 0.0 False 0.0
model1_arsenic 1 -1968.4 3.1 16.1 0.5 15.6 22.5 False 0.4
Note on dse and subsampling_dse

The output is very similar to regular compare() output with an additional column for subsampling uncertainty, subsampling_dse. Note that dse already accounts for the subsampling uncertainty, so we cannot add subsampling_dse to it. We report subsampling_dse to show the exact amount of subsampling uncertainty.

When using different subsamples, notice we get a warning message indicating that the subsamples are different. This warning indicates that the comparison cannot account for the correlation between models’ predictions. In this case, the standard error of the difference dse is computed naively by treating the two models as independent, which substantially inflates the uncertainty because we are losing the benefit of the covariance term \(\operatorname{Cov}(\operatorname{elpd}_A, \operatorname{elpd}_B)\) discussed earlier.

8.8.2 Comparison with the same subsample

To properly compare models, we reuse the exact same observations by passing the first LOO subsample object to the second loo_subsample() call. This ensures both models are evaluated on the same subset of observations, which is critical for accurate model comparison.

loo_ss_2 = azp.loo_subsample(
    data=data2,
    observations=loo_plpd.loo_subsample_observations,
    var_name="y",
    method="plpd",
    log_lik_fn=log_lik_fn,
    param_names=["beta"],
    pointwise=True,
)

loo_ss_2
Computed from 4000 by 100 subsampled log-likelihood
values from 3020 total observations.

         Estimate   SE subsampling SE
elpd_loo   -1952.2 16.2            0.2
p_loo          3.0

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      100  100.0%
   (0.70, 1]   (bad)         0    0.0%
    (1, Inf)   (very bad)    0    0.0%

Now we compare the models using the same subsample.

comparison_same = azp.compare({
    "model1_arsenic": loo_plpd,
    "model2_log_arsenic": loo_ss_2,
}).round(1)

comparison_same
rank elpd p elpd_diff weight se dse warning subsampling_dse
model2_log_arsenic 0 -1952.2 3.0 0.0 0.5 16.2 0.0 False 0.0
model1_arsenic 1 -1968.4 3.1 16.2 0.5 15.6 4.4 False 0.1

When using the same subsample, we can see that the standard error of the difference is dramatically reduced compared to the previous comparison. This is because the correlation between models’ predictions on the common subsample is properly accounted for in the variance calculation. The SE reduction is substantial, making it much easier to detect meaningful performance differences.

In this example, the second model (with log-transformed arsenic) has higher ELPD than the first model, indicating better predictive performance. The subsampling SE is very small relative to the difference in ELPD, indicating that the subsample size of 100 is sufficient for accurate comparison. If the subsampling SE were large, we could increase the subsample size using update_subsample() as demonstrated earlier.

8.9 Comparing full LOO to subsampled LOO

We can also compare a subsampled LOO object to a full LOO object that was computed on all \(N = 3020\) observations.

The following computes full PSIS-LOO-CV for the first model using the loo() function.

loo_full = azp.loo(data, var_name="y", pointwise=True)
loo_full
Computed from 4000 posterior samples and 3020 observations log-likelihood matrix.

         Estimate       SE
elpd_loo -1968.47    15.60
p_loo        3.24        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)     3020  100.0%
   (0.70, 1]   (bad)         0    0.0%
    (1, Inf)   (very bad)    0    0.0%

When comparing a full LOO calculation to a subsampled calculation, the compare() function automatically uses only the observations included in both calculations.

comparison_mixed = azp.compare({
    "model1_full_loo": loo_full,
    "model2_subsampled_loo": loo_ss_2
}).round(1)

comparison_mixed
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/arviz_stats/loo/compare.py:325: UserWarning: Estimated elpd_diff using observations included in loo calculations for all models.
  warnings.warn(
rank elpd p elpd_diff weight se dse warning subsampling_dse
model2_subsampled_loo 0 -1952.2 3.0 0.0 0.5 16.2 0.0 False 0.0
model1_full_loo 1 -1968.5 3.2 16.3 0.5 15.6 4.4 False 0.2

Notice that this comparison triggers a warning message indicating that only observations included in both LOO calculations are used for the difference estimate. Since the subsampled LOO only evaluated 100 observations, the comparison is based on those 100 observations rather than the full dataset. We can also see an increase in subsampling_dse compared to the previous comparison which is due to a technical detail that we omit from this example.

8.10 Practical considerations

For moderately sized models, the PLPD surrogate typically suffices because the posterior mean encapsulates most of the predictive information. When the model has hierarchical components or weakly identified parameters, Magnusson et al. (2020) recommend richer surrogates such as the LPD or diagnostics-driven truncation of importance weights to reduce the approximation error \(e_j\) and therefore the subsampling variance. Truncated importance sampling plays a similar role by clipping large \(r_{is}\) values before applying the Pareto smoothing step. Regardless of the surrogate, increasing \(m\) under the difference estimator guarantees convergence because the finite population correction shrinks the variance as \(m\) approaches \(N\).

When comparing multiple models via subsampling, it is recommended to reuse a single subsample across all fits. In this case, the difference estimator produces \(\widehat{\operatorname{ELPD}}_{\text{diff}}\) for each model and the estimated difference \(\widehat{\operatorname{ELPD}}_{A} - \widehat{\operatorname{ELPD}}_{B}\) inherits the same subsampling indices. Consistency results in Magnusson et al. (2020) ensure that even with fixed \(m\) and a finite number of posterior draws, the estimator converges as the quality of the surrogate improves.