[[7]{.chapter-number}  [Model Comparison]{.chapter-title}]{#sec-model-comparison .quarto-section-identifier}

7 Model Comparison

7.1 Overview

Models are designed as approximations to help us understand a specific problem or a related class of problems. They are not intended to be exact replicas of the real world. In this sense, all models are “wrong”. However, not all models are equally wrong, some are better suited to describing a particular problem. Even more some models could be better to explain some of the observations and other better for some other, for example a model that is good for high temperature observations may not be so good for low temperature observations.

During a typical data analysis, it’s common to develop multiple models to fit the same data. These models may differ in aspects such as priors, likelihoods, the inclusion of only linear terms versus adding smooth or interaction terms, and so on. When faced with multiple models, a natural question arises: how should we choose between them?

Understanding the problem at hand can often provide significant guidance. This is usually the case because usually we, as modellers, have knowledge about the problem and about the goals of the analysis, that we did not include in the models we are comparing. For example, if we are interested in making predictions, we may prefer a model that is better at predicting new data, even if we are not able to interpret it parameters. Instead, if we are interested in understanding the underlying processes, we may prefer a model that is simpler but easier to interpret, even if it does not fit the data as well.

As we already saw in Chapter 5 prior and posterior predictive checks can also assist in evaluating different models. In some cases, computational considerations like convergence issues discussed in Chapter 4 may influence our choice. Often even tradition and convention play a role in the decision-making process. For example, in some fields, certain models are more commonly used than others, so you may feel pressured to use a model that is more widely know. Of course, we should strive to make scientific decisions based on the data and the problem at hand, rather than on tradition or convention. However, there is no benefit in negating the human-nature of our practices.

In this section, we will explore formal methods that complement other tools we have seen to compare models. Hopefully, these tools will help us make more informed decisions.

7.2 The balance between simplicity and accuracy

When choosing between alternative explanations, there is a principle known as Occam’s razor. In very general terms, this principle establishes that given two or more equivalent explanations for the same phenomenon, the simplest one is the preferred explanation. When the explanations are models, a common criterion for simplicity is the number of parameters in a model.

Another factor we generally need to consider when comparing models is their accuracy, that is, how good a model is at fitting the data. According to this criterion, if we have two (or more) models and one of them explains the data better than the other, then that is the preferred model.

Intuitively, it seems that when comparing models, we tend to prefer those that fit the data best and those that are simpler. But what to do if these two principles conflict? Or more generally, is there a quantitative way to consider both contributions? The short answer is yes. In fact there is more than one way to do it.

7.3 Predictive accuracy measures

If we evaluate a model solely by it capacity to reproduce the data used to fit it, for instance by measuring the mean quadratic error between observations and predictions, we will likely be overoptimistic about the performance of the model to predict unobserved data. Additionally, for a flexible enough model (and without proper regularization), we may tweak it parameters until we fit the data perfectly. Thus, instead of computing the Within-sample accuracy, that is, the accuracy measured with the same data used to fit the model, we prefer to compute the Out-of-sample accuracy, that is, the accuracy measured with data not used to fit the model.

Note

Bayesian models through the use of prior, and the fact that the posterior is computed by marginalizing over those priors, is usually less prone to overfitting than alternative methods. This is less true for more vague priors, and more true for priors inducing a prior predictive distribution compatible with the domain-knowledge. But overall, is not true that we can not overfit Bayesian models. Thus, in practice we often end up needing to compare Bayesian models using some measure of predictive performance or predictive accuracy.

The easier way to compute out-of-sample accuracy is to have at least two datasets, one we use to fit the models, and one we use to evaluate them. There are even more complex ways to partition the data, but the main point for our current discussion is that usually that’s a luxury. Data is valuable, and not using all data to fit models can be a waste of information and resources. As this is a pervasive situation in data analysis, many different methods have been developed in order to evaluate the predictive accuracy of models without wasting data.

We re going to discuss two family of methods:

Cross-validation: This is an empirical strategy based on dividing the available data into separate subsets that are used to fit and evaluate alternatively. So this is a way to simulate having a hold-out dataset for model evaluation, but actually using all the available data for inference.
Information criteria: This is a general term used to refer to various expressions that approximate out-of-sample accuracy as in-sample accuracy plus a term that penalizes model complexity.

7.4 Information criteria

Information criteria are a collection of closely related tools used to compare models in terms of goodness of fit and model complexity. In other words, information criteria formalize the intuition we developed at the beginning of the chapter. The exact way these quantities are derived has to do with a field known as Information Theory.

An intuitive way to measure how well a model fits the data is to calculate the root mean square error between the data and the predictions made by the model:

\[ \frac{1}{N} \sum _{i}^{N} (y_i - \operatorname{E} (y_i \mid \theta))^2 \]

\(\operatorname{E} (y_i \mid \theta)\) is the predicted value given the estimated parameters. It is important to note that this is essentially the average of the difference between the observed and predicted data. Taking the square of the errors ensures that differences do not cancel out and emphasizes larger errors compared to other alternatives such as calculating the absolute value.

The mean square error may be familiar to us since it is very popular. But if we stop and reflect on this quantity we will see that in principle there is nothing special about it and we could well come up with more general expressions.

7.5 Entropy

For a probability distribution with \(N\) possible different events which each possible event having probability \(p_i\), the entropy is defined as:

\[ H(p) = - \mathbb{E}[\log{p}] = -\sum_i^N p_i \log{p_i} \]

Entropy is a measure of the uncertainty of a distribution. In this sense we can say that the uncertainty contained in a distribution is the logarithm of the average probability of an event. If only one event is possible the entropy will be 0, if all events have the same probability the entropy will be maximum. The concept of entropy can be extended to continuous distributions, but we will not go into those details. Figure 7.1 shows the entropy of a Bernoulli distribution for four different values of the probability of success. We can see that the entropy is maximum when the probability of success is 0.5, and minimum when the probability of success is 0.

Code

_, axes = plt.subplots(2, 2, sharex=True, sharey=True)

for p, ax in zip([0.5, 0.1, 0.9, 0.0001], axes.ravel()):
    dist = pz.Bernoulli(p=p)
    dist.plot_pdf(ax=ax, legend=False)
    ax.set_title(f"Entropy={dist.entropy():.2f}")
    ax.set_ylim(-0.05, 1.05)

Figure 7.1: Entropy of a Bernoulli distribution as a function of the probability of success

The concept of entropy appears many times in statistics. It can be useful, for example when defining priors. In general we want to use a prior that has maximum entropy given our knowledge (see for example PreliZ’s maxent function). And also when comparing models as we will see in the next section.

7.6 KL divergence

The Kulback-Leibler divergence is a measure of how one probability distribution diverges from a second expected probability distribution. Suppose we have a target distribution \(p\), with which we cannot work directly and we only have access to a different distribution that we will call \(q\). We want to evaluate how well \(q\) approximates \(p\). One way to do this is to measure the Kulback-Leibler divergence between \(p\) and \(q\). If \(q\) is a parametric family we can find the parameters making \(q\) as close to \(p\) as possible by minimizing the KL divergence. The KL divergence is defined as:

\[ \mathbb{KL}(p \parallel q) = \overbrace{-\sum_i^N p_i \log{q_i}}^{H(p, q)} - \overbrace{\left(-\sum_{i}^n p_i \log{p_i}\right)}^{H(p)} \]

Notice that it has two components, the entropy of \(p\), \(H(p)\) and the cross entropy \(H(p, q)\), that is, the entropy of \(q\) but evaluated according to \(p\). This may seem somewhat abstract, but if we think that we have \(N\) samples that we assume come from an unknown distribution \(p\) and we have a model described by \(q(y \mid \theta)\), then we will see that we are describing a typical situation in data analysis.

According to this expression, the KL divergence represents the “extra” entropy that we introduce when approximating \(p\) by \(q\). It is common to find it written in other ways, such as:

\[ \mathbb{KL}(p \parallel q) \quad=\quad- \sum_i^N p_i (\log{q_i} - \log{p_i}) \quad=\quad \mathbb{E}_p[\log{p}] - \mathbb{E}_p[\log{q}] \quad=\quad \sum_i^N p_i \log{\frac{p_i}{q_i}} \]

If \(p\) represents the data generating process or the population or the true distribution, and \(q\) represents our model. It may seems that this expressions are all useless because we don’t know \(p\). That the reason we are trying to fit a model in the first place. But, if our goal is to compare \(m\) models represented with \(q_0, q_1, \cdots, q_m\), we can can still use the KL divergence to compare them! The reason is that even when we do not know \(p\), its entropy is a constant term for all comparisons.

\[ \begin{split} \mathbb{KL}(p \parallel q_0) =&\; \mathbb{E}[\log{p}] - \mathbb{E}[\log{q(y \mid \theta_0)}] \\ \mathbb{KL}(p \parallel q_1) =&\; \mathbb{E}[\log{p}] - \mathbb{E}[\log{q(y \mid \theta_1)}] \\ &\cdots \\ \mathbb{KL}(p \parallel q_2) =&\; \mathbb{E}[\log{p}] - \mathbb{E}[\log{q(y \mid \theta_2)}] \end{split} \]

This tell us that when comparing models the best model, the best one from the set of compared models, will be the one that has the larger (log-)likelihood value. In other words, minimizing the KL divergence is proportional to maximizing likelihood.

In practice we don’t really have \(\mathbb{E}[\log{q}]\), but we can estimate this quantity from our sample. But we have to be carefull, if we use the sample to estimate the parameters of a model and then use those parameters to estimate \(\mathbb{E}[\log{q}]\) we will be introducing a bias. We will be overconfident on the ability of our model to explain the data. In the next sections we will see 3 strategies to avoid this problem.

7.7 Akaike information criterion

The Akaike information criterion (Akaike 1974) (AIC) is a very well-known and widely used information criterion and is defined as:

\[ AIC = -2 \sum_i^N \log p(y_i \mid \hat{\theta}_{mle}) + 2 k \]

Where, \(k\) is the number of model parameters and \(\hat{\theta}_{mle}\) is the maximum likelihood estimate for \(\theta\). For the rest of our discussion we will omit the constant -2 and write

\[ AIC = \sum_i^N \log p(y_i \mid \hat{\theta}_{mle}) - k \]

In this way it is easier to see that the Akaike criterion is a penalized maximum likelihood, it becomes smaller the more parameters a model has. Furthermore, this version without the -2 has a clearer correspondence with other expressions which we will see below.

Note

That the number of parameters is a valid penalty criterion follows our intuition, a model with a greater number of parameters is, in general, more flexible. But it is interesting to note that the Akaike’s criterion has a theoretical justification, it is not that Akaike simply thought that using \(k\) was a good idea.

The AIC criterion is very useful, but can be very limited for Bayesian models. One reason is that it uses a point estimate of \(\theta\) and not the posterior distribution, hence it discards potentially useful information. Furthermore AIC, from a Bayesian perspective, assumes that priors are flat and therefore AIC is incompatible with informative and/or weakly informative priors. Furthermore, the number of parameters in a model is not always a good measure of its complexity. In general, a regularized model will be a model with less effective number of parameters. For example, when using informative priors or in hierarchical models, parameters becomes interrelated and thus the effective number of parameters can be smaller than the actual number of parameter. AIC has no way to account for this.

Can we find something like the Bayesian version of AIC? Yes, we can.

7.8 ELPD

As we already saw in the Akaike criterion, the goodness of fit is given by:

\[ \sum_i^N \log p(y_i \mid \hat{\theta}_{mle}) \]

But in Bayesian statistics, we do NOT have a point estimate of \(\theta\). We have a distribution. To account for this we could do:

\[ \sum_i^N \log \int \ p(y_i \mid \theta) \; p(\theta \mid y) d\theta \]

In general we do not have an analytical expression for the posterior, \(p(\theta \mid y)\), instead we usually work with samples (such as those obtained by MCMC), then we can approximate the above integral by a sum over the \(S\) posterior samples:

\[ \sum_i^N \log \left(\frac{1}{S} \sum _{j}^S p(y_i \mid \theta^j) \right) \]

We will call this quantity the ELPD, which is short for expected log-predictive density. When the likelihood is discrete, we should use “probability” instead of “density”, but it is a common practice to avoid pedantry.

The ELPD is more Bayesian way to measure goodness of fit that the term used in AIC, but we are still missing one element, the penalization term.

7.9 WAIC

The Widely applicable information criterion (WAIC) uses the ELPD plus a penalization term (Watanabe 2013).

\[ WAIC = \sum_i^N \log \left(\frac{1}{S} \sum _{s}^S p(y_i \mid \theta^j) \right) - \sum_i^N \left( V_{j}^S \log p(y_i \mid \theta^j) \right) \]

We can see that penalization term is given by the variance of the log-likelihoods over the \(S\) posterior samples. Justifying this term requieres a bit more work, but the intuition is that the variance of the log-likelihoods is a measure of how much variability there is in the predictions made by the model. The more variability, the more flexible the model is. And therefore, the more we should penalize it. Let’s look at a linear model as an example:

\[ Y = \alpha + \beta X \]

A model where \(\beta=0\) will be less flexible, since it is equivalent to a model that only has one parameter, \(\alpha\). In a slightly more subtle way, a model where \(\beta\) varies in a narrow range will be less flexible (more regularized), than a model where \(\beta\) can take any value. WAIC properly formalized this intuition.

7.10 Cross validation

There is an alternative route to ELPD penalization. What if we just compute it for unobserved data? This is sometimes done when the size of the dataset is very large. But even then we will risk wasting relevant information. Then a common practice is to use a method called cross-validation.

Cross-validation is a simple and, in most cases, effective solution for comparing models. We take our data and divide it into \(K\) slices. We try to keep the portions more or less the same (in size and sometimes also in other characteristics, such as an equal number of classes). We then use \(K-1\) portions to train the model and the rest to evaluate it. This process is systematically repeated leaving, for each iteration, a different portion out of the training set and using that portion as the evaluation set. This is repeated until we have completed \(K\) rounds of adjustment-evaluation. The accuracy of the model will be the average over the \(K\) rounds. This is known as K-fold cross validation. Finally, once we have cross-validated, we use all the data to fit our model and this is the model that is used to make predictions or for any other purpose.

When \(K\) is equal to the number of data points, we get what is known as leave-one-out cross-validation (LOO-CV).

Cross validation is a routine practice in machine learning. And we have barely described the most essential aspects of this practice. For more information you can read The Hundred-Page Machine Learning Book or Python Machine Learning, by Sebastian Raschka.

One downside of cross-validation is that it is computationally expensive. We need to fit the model \(K\) times, and if we have a large dataset, this can be very expensive. But lucky us, by being Bayesian we can approximately compute LOO-CV in a very fast way.

7.11 ELPD and cross validation

The LOO-CV ELPD can be computed as:

\[ \sum_i^N \log \left( \frac{1}{S}\sum_j^S \mathbin{\color{#E9692C}{p(y_i \mid \theta _{-i}^j)}} \right) \]

where \(_{-i}\) means that we leave observation \(i\) out. A Naive implementation of this estimation requires that we estimate as many posterior distributions as observations we have, since for each of them we will eliminate one observation. However, this is not necessary since it is possible to estimate \(\color{#E9692C}{p(y_i \mid \theta _{-i}^j})\) using Importance Sampling.

Before continuing with our agenda, we need to take a short detour.

7.11.1 Importance Sampling

This is a technique for estimating properties of a distribution of interest \(f\), given that we only have samples from a distribution \(g\). Using importance sampling makes sense, for example, when it is simpler to sample \(g\) than \(f\).

If we have a set of samples of the random variable \(X\) and we can evaluate \(g\) and \(f\) point-wise, we can calculate the importance weights as:

\[\begin{equation} w_i = \frac{f(x_i)}{g(x_i)} \end{equation}\]

Computationally it looks like this:

Extract \(N\) samples \(x_i\) from \(g\)
Calculate the probability of each sample \(g(x_i)\)
Evaluate \(f\) on the \(N\) samples \(f(x_i)\)
Calculate the importance weights \(w_i = \frac{f(x_i)}{g(x_i)}\)

Once the weights \(w_i\) are obtained, we can use them to estimate properties of \(f\), its density, moments, quantiles, etc.

In the code-block below \(g\) is a Normal distribution and \(f\) is a Gamma and we use importance sampling to estimate the PDF of \(f\). This is just a pedagogic example, since we actually have a very direct way to calculate the PDF of a Gamma. But in practice \(f\) can be a much more complex object.

g = pz.Normal(0, 10)
samples = g.rvs(1000, random_state=SEED)
f = pz.Gamma(mu=4, sigma=1.5)

w = f.pdf(samples) / g.pdf(samples)

ax = f.plot_pdf()
ax.hist(samples, bins=100, density=True, weights=w, 
        alpha=0.6, color='C2', label='Weighted samples')
ax.set_xlim(0, 15);

When doing importance sampling, the more similar \(g\) and \(f\) are, the better the results will be. In practice, inferences are more reliable when \(g\) has a larger support than \(f\), that is, when it is “wider”, intuitively we need the samples of \(g\) to cover the entire support of \(f\), or actually to ensure we are not missing any high-density regions.

7.11.2 Coming back to our discussion

Now that we have a better idea of importance sampling let’s see how we can use it. The distribution we know is the posterior distribution, and the one we want to approximate by importance sampling is the posterior distribution leaving one out \(p(y_i \mid \theta_{-i}^j)\). Therefore, the importance weights that we are interested in calculating are:

\[ w_i^j \propto \frac{p(\theta^j \mid y_{-i} )}{p(\theta^j \mid y)} \propto \frac{p(\theta) \prod_{i\not =-i}^n p(y_i \mid \theta)}{p(\theta) \prod_i^n p(y_i \mid \theta)} \propto \frac{1}{p(y_i \mid \theta^j) } \]

That is the all terms in the numerator and the denominator will cancel out except for only one! The likelihood for the observation we want to remove, that will remain in the denominator. Note that these weights are not normalized, so to use them we need to divide each weight by the total sum of the weights. Then we can use the weights to estimate the ELPD as:

\[ \sum_i^N \log \left( \frac{1}{S} \sum_j^S w_i^j p(y_i \mid \theta^j) \right) \]

This result is fantastic news, it tells us that we can calculate the leave-one-out cross-validation ELPD, without having to refit the model \(N\) times.

The catch, because there is always a catch! Is that the expected \(p(\theta^j \mid y_{-i} )\) is “wider” than \(p(\theta^j \mid y)\), since it is a posterior distribution estimated with one less observation. This is the opposite to the ideal case in importance sampling. For many cases the difference may not be relevant, since eliminating an observation can lead to a practically equivalent posterior distribution. But in some cases the difference can be relatively large. When? Well, the more “influential” the observation the larger then difference when we remove it.

In terms of importance sampling this translates into weights with greater relative importance and which therefore tend to dominate the estimation. One way to correct this problem is to simply truncate the “too high” weights, this brings other problems that we are not going to discuss. Another way is to rely on theory. The theory indicates that under certain conditions high weights are distributed according to a Pareto pattern. So instead of truncating them we can fit them to a Pareto distribution and then replace them with values obtained from that distribution. This is a form of smoothing that, within a certain range, allows stabilizing the importance sampling estimate, since it will make some “very large” values not so large.

When we combine all these ideas we get a method called Pareto-Smooth Importance Sampling Leave-One-Out Cross Validation (Vehtari, Gelman, and Gabry 2017; Yao et al. 2018), which is abbreviated as PSIS-LOO-CV. Since the name and acronym are horribly long and difficult to pronounce we will call it LOO.

7.11.3 Calculating LOO

After all this introduction, calculating LOO may seem somewhat disappointing. We just need to call ArviZ’s loo function and pass it an DataTree object containing a log-likelihood group.

For the following example we are using a DataTree distributed with ArviZ. More details about the model in Chapter 9. For the moment we only need to know that is a model with two sets of observations, home_points and away_points. We can compute LOO for each one of those.

dt_rugby = azp.load_arviz_data('rugby')
elpd = azp.loo(dt_rugby, var_name="home_points")
elpd

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/arviz_stats/loo/helper_loo.py:1179: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.70 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(

Computed from 2000 posterior samples and 60 observations log-likelihood matrix.

         Estimate       SE
elpd_loo  -282.09    26.49
p_loo       25.16        -

There has been a warning during the calculation. Please check the results.
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)       57   95.0%
   (0.70, 1]   (bad)         3    5.0%
    (1, Inf)   (very bad)    0    0.0%

We can see that we get the estimated ELPD value using LOO and its standard error. p_loo can be roughly interpreted as the effective number of parameters. For some models this number should be close to the actual number of parameters, for models with regularization, like hierarchical models, it should be less than the actual number of parameters.

target = pz.StudentT(nu=4, mu=0, sigma=1).rvs(200, random_state=SEED)

with pm.Model() as model_n:
    μ = pm.Normal("μ", 0, 1)
    σ = pm.HalfNormal("σ", 1)
    pm.Normal("y", μ, σ, observed=target)
    idata_n = pm.sample(idata_kwargs={"log_likelihood":True})
    
with pm.Model() as model_t:
    μ = pm.Normal("μ", 0, 1)
    σ = pm.HalfNormal("σ", 1)
    ν = pm.Exponential("ν", scale=30)
    pm.StudentT("y", nu=ν, mu=μ, sigma=σ, observed=target)
    idata_t = pm.sample(idata_kwargs={"log_likelihood":True})

## coming soon

cmp_df = azp.compare({'model_n':idata_n, 'model_t':idata_t})
cmp_df

	rank	elpd	p	elpd_diff	weight	se	dse	warning
model_t	0	-327.519119	3.406163	0.000000	1.0	13.470384	0.000000	False
model_n	1	-334.187731	3.187292	6.668612	0.0	15.124893	4.510585	False

In the rows we have the compared models and in the columns we have

rank: the order of the models (from best to worst)
elpd_loo: the point estimate of the ELPD using LOO
p_loo: the effective number of parameters
elpd_diff: the difference between the ELPD of the best model and the other models
weight: the relative weight of each model. If we wanted to make predictions by combining the different models, instead of choosing just one, this would be the weight we should assign to each model. In this case we see that model_t takes all the weight.
se: the standard error of the ELPD
dse: the standard error of the differences
warning: a warning about whether there is at least one high k value
scale: the scale on which the ELPD is calculated

We can obtain similar information, but graphically, using the azp.plot_compare function

azp.plot_compare(cmp_df);

The open circles represent the ELPD values and black lines the standard error.
The highest ELPD value is indicated with a vertical dashed gray line for easy comparison with other values.
For all models except the best, we also obtain a triangle indicating the value of the ELPD difference between each model and the best model. The gray error bar indicating the standard error of the differences between the point estimates.

The simplest way to use information criteria is to choose a single model. Simply choose the model with the highest ELPD value. If we follow this rule we will have to accept that the quadratic model is the best. Even if we take into account the standard errors we can see that they do not overlap. Which gives us some security that the models are indeed different from each other. If, instead, the standard errors overlapped, we should provide a more nuanced answer.

7.11.5 Pareto k and LOO diagnostics

Then we can see a table with the title “Pareto k diagnostic values”. We previously said that we used a method involving a Pareto distribution to regularize the estimation of the importance weights. One of the parameters of that fit is called \(k\). Sometimes we call it \(\hat k\) (because is an estimate of \(k\)). Since we have a Pareto adjustment per observation we have a \(k\) value per observation. This parameter is useful because it tells us two sides of the same story, it tells us when an observation is “very influential” and it tells us that the approximation used by LOO could be failing for that observation.

As a general rule, if \(k\) is less than 0.7 there are no problems, if it’s between 0.7 and 1 is very likely that we are in trouble and if it’s greater than 1, we are doom. The cut-off value 0.7 is not fixed, it can strictly be lower and depends on the total number of samples of the posterior distribution, 2000, in this example. But when the number of draws is about 2000 we are almost at 0.7. In practice it is common to use sample values of 2000 or larger. Increasing the number of samples from the posterior may reduce the value of \(k\) and so we could remove some of these warnings, but in general the number needed will be too large to make any practical sense.

It is possible to visualize the values of \(k\), using plot_khat

# coming soon
#azp.plot_khat(elpd, threshold=0.7);

While the main function of LOO is to compare models, the values of \(k\) can be useful even if we only have only one model. For example, we could have extra knowledge that tells us why these observations are influential, perhaps there was a problem in data collection and the values are incorrect. Or perhaps the values are correct but from the perspective of our model they are influential, “strange”, “surprising”.

If \(k > 0.7\), the value of p_loo can give us some more information. Where \(p\) is the total number of parameters in a model.

If \(p_{\text{loo}} << p\) then the model must be misspecified. This should also be seen in post-hoc predictive testing. One solution is to use an overdispersed model (such as changing a Poisson for a NegativeBinomial or for a ZeroInflatedPoisson or HurdlePoisson, or changing a Normal for a Student’s T, etc.). Or it is likely that the model needs more structure or complexity, perhaps we need a non-linear term, etc.
If \(p_{\text{loo}} < p\) and the observations are relatively few compared to \(p\), (say \(p>N/5\)). It is likely that we have a model that is too flexible and/or priors that are too vague. This can happen for hierarchical models with very few observations per group or for example for splines with many knots or Gaussian processes with very short scale values.
If \(p_{\text{loo}} > p\), then the model has very serious problems. If \(p<<N\), then posterior predictive tests should also report problems. If, however, p is relatively large (say \(p>N/5\)). So post-hoc predictive testing may not reflect problems.

7.12 LOO and WAIC

LOO and WAIC converge asymptotically, and they based on the same set of assumptions. So theoretically they are equivalent. However, in practice LOO is more robust, and also offers us a diagnosis that indicates when it could be failing (this is thanks to the Pareto adjustment). So in practice we prefer LOO.

7.13 Other information criteria

Another widely used information criterion is DIC, if we use the bayes-o-meter™, DIC is more Bayesian than AIC but less than WAIC. Although still popular, WAIC and mainly LOO have proven to be more useful both theoretically and empirically than DIC. Therefore we DO NOT recommend its use.

Another widely used criterion is BIC (Bayesian Information Criteria), like logistic regression and my mother’s dry soup, this name can be misleading. BIC was proposed as a way to correct some of the problems with AIC and the author proposed a Bayesian rationale for it. But BIC is not really Bayesian in the sense that like AIC it assumes flat priors and uses maximum likelihood estimation.

But more importantly, BIC differs from AIC and WAIC in its objective. AIC and WAIC try to reflect which model generalizes better to other data (predictive accuracy) while BIC tries to identify which is the correct model and therefore is more related to Bayes factors than with WAIC. Later we will discuss Bayes Factors and see how it differs from criteria such as WAIC and LOO.

7.14 Bayes factors

An alternative to cross-validation, approximate cross-validation with LOO and information criteria is Bayes factors. It is common for Bayes factors to show up in the literature as a Bayesian alternative to frequentist hypothesis testing.

We can compare \(K\) models by computing their marginal likelihood, \(p(y \mid M_k)\), i.e., the probability of the observed data \(Y\) given the model \(M_K\). The marginal likelihood is the normalization constant of Bayes’ theorem. We can see this if we write Bayes’ theorem and make explicit the fact that all inferences depend on the model.

\[ p (\theta \mid Y, M_k ) = \frac{p(Y \mid \theta, M_k) p(\theta \mid M_k)}{p(Y \mid M_k)} \]

where, \(Y\) is the data, \(\theta\) is the parameters, and \(M_K\) is a model out of \(K\) competing models.

If our main objective is to choose only one model, the best from a set of models, we can choose the one with the largest value of \(p(y \mid M_k)\). This is fine if we assume that all models have the same prior probability. Otherwise, we must calculate:

\[ p(M_k \mid y) \propto p(y \mid M_k) p(M_k) \]

If, instead, our main objective is to compare models to determine which are more likely and to what extent, this can be achieved using the Bayes factors:

\[ BF_{01} = \frac{p(y \mid M_0)}{p(y \mid M_1)} \]

That is the ratio between the marginal likelihood of two models. The higher the value of \(BF_{01}\), the better the model in the numerator (\(M_0\) in this example). To facilitate the interpretation of the Bayes factors, and to put numbers into words, Harold Jeffreys proposed a scale for their interpretation, with levels of support or strength, see the following table.

Bayes Factor	Support
1–3	Anecdotal
3–10	Moderate
10–30	Strong
30–100	Very Strong
>100	Extreme

Keep in mind that if you get numbers below 1, then the support is for \(M_1\), i.e., the model in the denominator. Tables are also available for those cases, but notice that you can simply take the inverse of the obtained value.

It is very important to remember that these rules are just conventions – simple guides at best. Results should always be put in the context of our problems and should be accompanied by enough detail so that others can assess for themselves whether they agree with our conclusions. The proof necessary to ensure something in particle physics, or in court, or to decide to carry out an evacuation in the face of a looming natural catastrophe is not the same.

7.14.1 Some observations

We will now briefly discuss some key facts about the marginal likelihood:

The good: Occam’s razor included. Models with lots of parameters have a higher penalty than models with few parameters. The intuitive reason is that the greater the number of parameters, the more the prior extends with respect to the likelihood. An example where it is easy to see this is with nested models: for example, a polynomial of order 2 “contains” the models polynomial of order 1 and polynomial of order 0.
The bad: For many problems, the marginal likelihood cannot be calculated analytically. Also, approximating it numerically is usually a difficult task that in the best of cases requires specialized methods and, in the worst case, the estimates are either impractical or unreliable. In fact, the popularity of the MCMC methods is that they allow obtaining the posterior distribution without the need to calculate the marginal likelihood.
The ugly: The marginal likelihood depends very sensitively on the prior distribution of the parameters in each model \(p(\theta_k \mid M_k)\).

It is important to note that the good and the ugly points are related. Using marginal likelihood to compare models is a good idea because it already includes a penalty for complex models (which helps us prevent overfitting), and at the same time, a change in the prior will affect the marginal likelihood calculations. At first, this sounds a bit silly; we already know that priors affect calculations (otherwise we could just avoid them). But we are talking about changes in the prior that would have a small effect in the posterior but a great impact on the value of the marginal likelihood.

The use of Bayes factors is often a watershed among Bayesians. The difficulty of its calculation and the sensitivity to the priors are some of the arguments against it. Another reason is that, like p-values and hypothesis testing in general, Bayes factors favor dichotomous thinking over the estimation of the “effect size.” In other words, instead of asking ourselves questions like: How many more years of life can a cancer treatment provide? We end up asking if the difference between treating and not treating a patient is “statistically significant.” Note that this last question can be useful in some contexts. The point is that in many other contexts, this type of question is not the question that interests us; we’re only interested in the one that we were taught to answer.

7.14.2 Calculation of Bayes factors

The marginal likelihood (and the Bayes factors derived from it) is generally not available in closed form, except for a few models. For this reason, many numerical methods have been devised for its calculation. Some of these methods are so simple and naive that they work very poorly in practice. We are going to discuss only one way to compute them, once that can be applied under some particular cases.

7.14.3 Savage–Dickey ratio

There are times when we want to compare a null hypothesis \(H_0\) (or null model) against an alternative \(H_1\) hypothesis. For example, to answer the question “Is this coin biased?”, we could compare the value \(\theta = 0.5\) (representing no bias) with the output of a model in which we allow \(\theta\) to vary. For this type of comparison, the null model is nested within the alternative, which means that the null is a particular value of the model we are building. In those cases, calculating the Bayes factor is very easy and does not require any special methods. We only need to compare the prior and posterior evaluated at the null value (for example, \(\theta = 0.5\)) under the alternative model. We can see that this is true from the following expression:

\[ BF_{01} = \frac{p(y \mid H_0)}{p(y \mid H_1)} \frac{p(\theta=0.5 \mid y, H_1)}{p(\theta=0.5 \mid H_1)} \]

This is true only when \(H_0\) is a particular case of \(H_1\), see The Book of Statistical Proofs (Soch et al. 2024).

Let’s do it. We only need to sample the prior and posterior for a model. Let’s try the BetaBinomial model with a Uniform prior:

PyMC
CmdStanPy

y = np.repeat([1, 0], [50, 50])  # 50 heads, 50 tails
with pm.Model() as model_uni:
    a = pm.Beta("a", 1, 1)
    yl = pm.Bernoulli("yl", a, observed=y)
    idata_uni = pm.sample(2000, random_seed=42)
    idata_uni.extend(pm.sample_prior_predictive(8000))

## coming soon

And now we call azp.plot_bf

azp.plot_bf(idata_uni, var_names="a", ref_val=0.5);

In the previous Figure we can see one KDE for the prior (black) and one for the posterior (gray). The two black dots show that we evaluated both distributions at the value 0.5. We can see that the Bayes factor in favor of the null hypothesis, BF_01, is \(\approx 8\), which we can interpret as moderate evidence in favor of the null hypothesis.

As we have already discussed, the Bayes factors measure which model, as a whole, is better at explaining the data. This includes the prior, even for models that the prior has a relatively low impact on the computation of the posterior. We can also see this prior effect by comparing a second model to the null model.

If, instead, our model were a BetaBinomial with a prior Beta(30, 30), the BF_01 would be lower ( on the Jeffrey scale). This is because, according to this model, the value of \(\theta=0.5\) is much more likely a priori than for a Uniform prior, and therefore the prior and posterior will be much more similar. That is, it is not very to see that the posterior is concentrated around 0.5 after collecting data. Don’t just believe me, let’s calculate it:

PyMC
CmdStanPy

with pm.Model() as model_conc:
    a = pm.Beta("a", 30, 30)
    yl = pm.Bernoulli("yl", a, observed=y)
    idata_conc = pm.sample(2000, random_seed=42)
    idata_conc.extend(pm.sample_prior_predictive(8000))

## coming soon

azp.plot_bf(idata_conc, var_names=["a"], ref_val=0.5);

We can see that the BF_01 is \(\approx 1.6\), which we can interpret as anecdotal evidence in favor of the null hypothesis (see the Jeffreys’ scale, discussed earlier).

7.14.4 Bayes factors vs the alternatives

We could say that the Bayes factors measure which model, as a whole, is better for explaining the data. This includes the details of the prior, no matter how similar the model predictions are. In many scenarios, this is not what interests us when comparing models. For many real problems prior are not intended to be an accurate description of the True prior distribution of parameters, instead in many problems priors are choosen using some information and with the goal of providing some regulatization. In this and other cases we prefer to evaluate models in terms of how similar their predictions are. For those cases, we can use LOO.

7.15 Absolute metrics

In the previous sections we have seen how to compare models using relative metrics. By relative we mean that we are comparing the models with respect to each other. For example, we can say that model A is better than model B because it has a higher ELPD value. But we can not, in general, judge a single model by its ELPD alone. In contrast, sometimes we are interested in absolute metrics. Some common absolute metrics are the root mean square error (RMSE), the mean absolute error (MAE), the coeﬀicient of determination (\(R^2\)), etc. Interestingly, we can use PSIS-LOO-CV procedure to compute the leave-one-out cross-validation version of these metrics.

7.15.1 LOO expectations and metrics

From the PSIS smoothing procedure, we obtain a set of weights. The loo function utilizes these weights to estimate the ELPD we should have obtained if we have performed leave-one-out cross-validation. Furthermore, these weights can be used to estimate other quantities, such as the mean, standard deviation, and quantiles, as if we had performed leave-one-out cross-validation. For example, to compute the 25th and 75th percentiles we use:

azp.loo_expectations(dt_rugby, kind="quantile", probs=[0.25, 0.75])

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/arviz_stats/loo/helper_loo.py:1179: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.70 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(

(<xarray.DataArray 'home_points' (quantile: 2, match: 60)> Size: 960B
 array([[44.        , 12.        , 25.        , 18.        , 14.        ,
         32.        , 21.        , 15.        , 16.        , 38.        ,
         16.        , 20.        , 10.        , 23.        , 10.        ,
         18.        ,  9.        , 21.        , 45.        , 20.        ,
         13.        , 33.        , 15.        , 17.        , 14.        ,
         30.        , 14.        , 11.        , 10.        , 22.        ,
         33.        , 13.        , 18.        , 11.        , 26.        ,
         10.        , 22.        , 15.        , 16.        , 37.        ,
         21.        , 14.        , 37.        , 24.        , 13.        ,
         10.        , 24.        , 12.        ,  8.        , 18.        ,
         20.        , 13.        , 20.        , 47.        , 14.        ,
         12.        , 25.        , 31.        , 14.        , 17.        ],
        [54.        , 18.        , 32.        , 24.        , 20.        ,
         41.        , 28.        , 21.        , 23.        , 47.        ,
         22.        , 28.        , 15.        , 31.        , 15.        ,
         25.        , 13.        , 28.        , 56.        , 27.        ,
         19.        , 42.        , 21.        , 23.        , 20.        ,
         39.        , 19.        , 16.        , 16.        , 28.        ,
         41.        , 19.        , 25.        , 16.        , 35.        ,
         16.        , 29.        , 21.        , 22.        , 46.        ,
         28.        , 20.        , 47.        , 31.        , 19.        ,
         14.        , 31.        , 17.        , 13.        , 25.        ,
         27.        , 18.67750933, 27.        , 57.        , 20.        ,
         17.        , 33.        , 39.        , 20.        , 23.        ]])
 Coordinates:
   * quantile   (quantile) float64 16B 0.25 0.75
   * match      (match) <U16 4kB 'Wales Italy' ... 'Ireland England'
     home_team  (match) <U8 2kB 'Wales' 'France' 'Ireland' ... 'France' 'Ireland'
     away_team  (match) <U8 2kB 'Italy' 'England' ... 'Wales' 'England',
 <xarray.DataArray 'home_points' (match: 60)> Size: 480B
 array([0.81019699, 0.28836737, 0.19853754, 0.13917616, 0.3967199 ,
        0.23962001, 0.16479781, 0.14713066, 0.3616447 , 0.1013727 ,
        0.09210699, 0.17722584, 0.09801876, 0.73061293, 0.17059836,
        0.23174109, 0.25935113, 0.22608703, 0.11200082, 0.19183078,
        0.347671  , 0.62174049, 0.00583348, 0.04696175, 0.30153295,
        0.28010278, 0.3377815 , 0.09288675, 0.15444295, 0.74229683,
        0.35880752, 0.19010797, 0.21098666, 0.22112489, 0.17530626,
        0.12022844, 0.10181887, 0.14713066, 0.10800236, 0.23071062,
        0.25231008, 0.30349302, 0.4710015 , 0.17825569, 0.18154059,
        0.3013977 , 0.16174767, 0.23836638, 0.16261163, 0.23174109,
        0.14606708, 0.50457581, 0.17899662, 0.20756025, 0.28074787,
        0.14313128, 0.70480686, 0.41149501, 0.00931864, 0.08937109])
 Coordinates:
   * match      (match) <U16 4kB 'Wales Italy' ... 'Ireland England'
     home_team  (match) <U8 2kB 'Wales' 'France' 'Ireland' ... 'France' 'Ireland'
     away_team  (match) <U8 2kB 'Italy' 'England' ... 'Wales' 'England')

Similarly we may be interested in computing estimates of leave-one-out predictive metrics given a set of predictions and observations. For instance to compute the root mean square error we can do:

azp.loo_metrics(dt_rugby, kind="rmse", var_name="home_points")

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/arviz_stats/loo/helper_loo.py:1179: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.70 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(

rmse(mean=11.0, se=1.2)

Notice that for loo_metrics if we have more than one variable we must specify the variable that we care using the var_names argument, like in the previous example.

7.15.2 LOO-PIT

Another quantity of interest that we can obtain via PSIS-LOO-CV is the PIT values. As already mentioned in Section 5.2.4, we often are interested in computing:

\[ p(\tilde y_i \le y_i \mid y_{-i}) \]

That is, we are evaluating the model’s ability to predict an observation when we remove that observation from the observed data. We can use PSIS-LOO_CV to estimate this from a single model fit.

azp.plot_loo_pit(dt_rugby, var_names="home_points")

Figure 7.2: LOO-PIT plot for the rugby model