5  Prior and Posterior predictive checks

Models are simplifications of reality, sometimes even very crude simplifications. Thus, we can never fully trust them. While hoping they are good enough is an option, we should try to do better. One general approach to criticizing model is to judge them by their predictions. If a model is robust, its predictions should align well with observations, domain knowledge, or some other benchmark. There are at least four avenues to explore:

As we can see there are plenty of options to evaluate models. But we still have one additional ingredient to add to the mix, we have omitted the fact that we have different types of predictions. An attractive feature of the Bayesian model is that they are generative. This means that we can simulate synthetic data from models as long as the parameters are assigned a proper probability distribution, computationally we need a distribution from which we can generate random samples. We can take advantage of this feature to check models before or after fitting the data:

Additionally, for models like linear regression where we have a set of covariates, we can generate synthetic data evaluated at the observed covariates (our “Xs”) or at different values (“X_new”). If we do the first we call it in-sample predictions, and if we do the second we call it out-of-sample predictions.

With so many options we can feel overwhelmed. Which ones we should use will depend on what we want to evaluate. We can use a combination of the previous options to evaluate models for different purposes. In the next sections, we will see how to implement some of these checks.

5.1 Prior predictive checks

The idea behind prior predictive checks is very general and simple: if a model is good it should be able generate data resembling our prior knowledge. We call these checks, prior predictive because we are generating synthetic data before we have seen the actual data.

The general algorithm for prior predictive checks is:

  1. Draw \(N\) realizations from a prior distribution.
  2. For each draw, simulate new data from the likelihood.
  3. Plot the results.
  4. Use domain knowledge to assess whether simulated values reflect prior knowledge.
  5. If simulated values do not reflect prior knowledge, change the prior distribution, likelihood, or both and repeat the simulation from step 1.
  6. If simulated values reflect prior knowledge, compute the posterior.

Notice that in step 4 we use domain knowledge, NOT observed data!

In steps 1 and 2 what we are doing is approximating this integral: \[ p(y^\ast) = \int_{\Theta} p(y^\ast \mid \theta) \; p(\theta) \; d\theta \]

where \(y^\ast\) represents unobserved but potentially observable data. Notice that to compute \(y^\ast\) we are evaluating the likelihood over all possible values ​​of the prior. Thus we are effectively marginalizing out the values of \(\theta\), the parameters.

To exemplify a prior predictive check, let’s try with a super simple example. Let’s say we want to model the height of humans. We know that the heights are positive numbers, so we should use a distribution that assigns zero mass to negative values. But we also know that at least for adults using a normal distribution could be a good approximation. So we create the following model, without too much thought, and then draw 500 samples from the prior predictive distribution.

with pm.Model() as model: 
    # Priors for unknown model parameters
    mu = pm.Normal('mu', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=10)
    # Likelihood (sampling distribution) of observations
    y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=y)
    # draw 500 samples from the prior predictive
    idata = pm.sample_prior_predictive(samples=500, random_seed=SEED)
Sampling: [Y_obs, mu, sigma]
## coming soon

In the following plot, we can see samples from the prior predictive distribution (blue solid lines). If we aggregate all the individual samples into a single large sample we get the dashed cyan line. To help us interpret this plot we have added two reference values, the average length/height of a newborn and the average (male) adult height. These reference values, are values that are meaningful to the problem at hand that we obtain from domain-knowledge (not from the observed data). We used reference values to get a sense of the scale of the expected data. Expected or typical values could also be used.

We can see that our model is bananas, not only heights can be negative, but the bulk of the prior predictive distribution is outside of our reference values.

ax = az.plot_ppc(idata, group="prior")
ax.axvline(50, color="0.5", linestyle=":", label="newborn")
ax.axvline(175, color="0.5", linestyle="--", label="adult")
plt.legend()
Figure 5.1: The prior predictive check for the model of heights. We can see that the bulk of the samples are outside the reference values.

We can tighten up the priors. There is no general rule to do this. For most problems, it is usually a good idea to set priors using some broad domain knowledge in such a way that we get a prior predictive distribution that allocates most of its mass in a reasonable region. These priors are called weakly informative priors. While there isn’t a formal definition of what a weakly informative prior is, we can think of them as priors that generate a prior predictive distribution with none to very little mass in disallowed or unlikely regions. For instance, we can use a normal distribution with a mean of 175 and a standard deviation of 10. This distribution doesn’t exclude negative values, but it assigns very little mass to them. Also is broad enough to allow for a wide range of values.

with pm.Model() as model: 
    # Priors for unknown model parameters
    mu = pm.Normal('mu', mu=175, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=10)
    # Likelihood (sampling distribution) of observations
    y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=y)
    # draw 500 samples from the prior predictive
    idata = pm.sample_prior_predictive(samples=500, random_seed=SEED)
Sampling: [Y_obs, mu, sigma]
## coming soon

We repeat the previous plot with the new prior predictive distribution. We can see that the bulk of the prior predictive distribution is within the reference values. The model predicts values above 200 cm and below 150 cm, which are indeed possible but are less likely. You are free to pick other priors and other reference values. Maybe you can use the historical record for the taller and shorter persons in the world as33 reference values.

ax = az.plot_ppc(idata, group="prior")
ax.axvline(50, color="0.5", linestyle=":", label="newborn")
ax.axvline(175, color="0.5", linestyle="--", label="adult")
plt.legend()
Figure 5.2: The prior predictive check for the model of heights with a more narrower prior than Figure 5.1. Predictions are closer to our domain knowledge about human heights.

Weakly informative priors can have practical advantages over very vague priors because adding some prior information is usually better than adding none. By adding some information we reduce the chances of spurious results or nonsensical results. Additionally, weakly informative priors can provide computational advantages, like helping with faster sampling.

Weakly informative priors can also have a practical advantage over informative priors. To be clear, if you have trustworthy informative priors you should use them, it doesn’t make sense to ignore information that you have. But in many setting informative priors can be hard to set, and they can be very time-consuming to get. That’s why in practice weakly informative priors can be a good compromise.

ax = az.plot_ppc(idata, group="prior", kind="cumulative")
ax.axvline(50, color="0.5", linestyle=":", lw=3, label="newborn")
ax.axvline(175, color="0.5", linestyle="--", lw=3, label="adult")
plt.legend()
Figure 5.3: The prior predictive check for the model of heights. Same as Figure 5.2 but using empirical CDFs instead of KDEs.

When plotting many distributions, where each one spans a narrow range of values compared to the range spanned but the collection of distributions, it is usually a good idea to plot the cumulative distribution, like in the previous figure. The cumulative distribution is easier to interpret in these cases compared to plots that approximate the densities, like KDEs, histograms, or quantile dot plots.

Note

One aspect that is often overlooked is that even if the priors have little effect on the actual posterior, by doing prior predictive checks and playing with the priors we can get a better understanding of our models and problems. If you want to learn more about prior elicitation you can check the PreliZ library.

5.2 Posterior predictive checks

The idea behind posterior predictive checks is very general and simple: if a model is good it should be able generate data resembling the observed data. We call these checks, posterior predictive because we are generating synthetic data after seeing the data.

The general algorithm for posterior predictive checks is:

  1. Draw \(N\) realizations from the posterior distribution.
  2. For each draw, simulate new data from the likelihood.
  3. Plot the results.
  4. Use observed data to assess whether simulated values agree with observed values.
  5. If simulated values do not agree with observations, change the prior distribution, likelihood, or both and repeat the simulation from step 1.
  6. If simulated values reflect prior knowledge, compute the posterior.

Notice that in contrast with prior predictive checks, we use observations here. Of course, we can also include domain knowledge to assess whether the simulated values are reasonable, but because we are using observations we do more stringent evaluations.

In steps 1 and 2 what we are doing is approximating this integral: \[ p(\tilde y) = \int_{\Theta} p(\tilde y \mid \theta) \; p(\theta \mid y) \; d\theta \]

where \(\tilde y\) represents new observations, according to our model. The data generated is predictive since it is the data that the model expects to see.

Notice that what we are doing is marginalizing the likelihood by integrating all possible values ​​of the posterior. Therefore, from the perspective of our model, we are describing the marginal distribution of data, that is, regardless of the values of the parameters.

Continuing with our height example, we can generate synthetic data from the posterior predictive distribution.

with model: 
    idata = pm.sample(idata_kwargs={"log_likelihood": True}, random_seed=SEED)
    pm.sample_posterior_predictive(idata, random_seed=SEED, extend_inferencedata=True)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [mu, sigma]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 1 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
Sampling: [Y_obs]


## coming soon

And then we use ArviZ to plot the comparison. We can see that the model is doing a good job at predicting the data. The observed data (black line) is within the bulk of the posterior predictive distribution (blue lines).

The dashed orange line, labelled as “posterior predictive mean”, is the aggregated posterior predictive distribution. If you combine all the individual densities into a single density, that’s what you would get.

az.plot_ppc(idata, num_pp_samples=200, colors=["C0", "k", "C2"]);
Figure 5.4: Posterior predictive check for the model of heights.

Other common visualizations to compare observed and predictive values are empirical CDFs, histograms and less often quantile dotplots. Like with other types of visualizations, you may want to try different options, to be sure visualizations are not misleading and you may also want to adapt the visualization to your audience.

5.2.1 Using summary statistics

Besides directly comparing observations and predictions in terms of their densities, we can do comparisons in terms of summary statistics. Which ones we decide to use can vary from one data-analysis problem to another, and ideally they should be informed by the data-analysis goals. As in posterior predictive checks we use the data twice, first for fitting the model and then for checking it. It is advisable to select test statistics that are orthogonal to the model parameters (Gabry et al. 2019). For example, in a Normal model with a location parameter, the mean should be easy to recover, so a posterior predictive check using the mean as a test statistic would not be a particularly stringent test. As in many common models there is a location parameter, then the mean is usually not a good test statistic.

The following plot shows a comparison in terms of the mean, median and interquartile range (IQR). The dots at the bottom of each subplots corresponds to the summary statistics computed for the observed data and the KDE is for the model’s predictions.

_, ax = plt.subplots(1, 3, figsize=(12, 3))

def iqr(x, a=-1):
    """interquartile range"""
    return np.subtract(*np.percentile(x, [75, 25], axis=a))

az.plot_bpv(idata, kind="t_stat", t_stat="mean", ax=ax[0])
az.plot_bpv(idata, kind="t_stat", t_stat="median", ax=ax[1])
az.plot_bpv(idata, kind="t_stat", t_stat=iqr, ax=ax[2])
ax[0].set_title("mean")
ax[1].set_title("median")
ax[2].set_title("IQR");
Figure 5.5: Posterior predictive check for the model of heights using summary statistics.

The numerical values labelled as bpv are know as Bayesian p-values, or posterior predictive p-values Gelman et al. (2013). If you want to use plot_bpv, but you prefer to omit the Bayesian p-values pass bpv=False.

The Bayesian p-values correspond to the following probability: \[ p(T_{\text{sim}} \le T_{\text{obs}} \mid \tilde y) \]

Where \(T\) is the summary statistic of our choice, computed for both the observed data \(T_{\text{obs}}\) and the simulated data \(T_{\text{sim}}\).

A posterior predictive p-value of 0.5 indicates that half of the predictions are below the observed values and half above. Posterior predictive p-values do not in general have uniform distributions under the null hypothesis but instead tend to have distributions more concentrated near 0.5 (Gelman 2013). For instance, we already mentioned that the mean is easy to recover for many models and thus the posterior predictive p-value for the mean is often concentrated around 0.5.

The term “Bayesian p-values” may sound like an oxymoron or paradoxical (Meng 1994). The Bayesian p-values are defined similar to their frequentist cousins and hence the name. But they are used in a very different way. We use posterior predictive p-values as a diagnostic tool to asses potential mismatches between model and data rather than as a measure of “statistical significance” or as a dichotomy decision tool. The null hypothesis is that the predictions from the model and the observed data are drawn from the same data-generating process, but in practice we are not interested in rejecting this hypothesis. We already know is not true! Instead, we are interested in understanding how well the model is doing at predicting the data, detecting potential problems, an if possible or desirable improving the model.

5.2.2 PIT-ECDFs

Instead of using a summary statistics, as before, we can directly compare observations and predictions by computing: \[ p(\tilde y_i \le y_i \mid y) \]

This is often called the marginal p-value and the ideal distribution is the standard uniform distribution. The intuition is that if the model can generate predictions from the same distribution as the observed data, then the observed data can be thought of as just one random sample from the posterior predictive distribution. In this case, the observed data point is equally likely to appear anywhere within the range of the predicted values. This means there’s no systematic bias in where the observation falls, and the p-values derived from comparing the observed data to the predictions will be uniformly distributed.

A more formal justification for this result is provided by the Probability Integral Transform (PIT). This property, also known as the universality of the Uniform distribution, states that if \(Y\) is a random variable with a continuous distribution and cumulative distribution function (CDF) \(F_Y\), then the transformed variable

\[ U = F_Y(Y) \]

follows a standard Uniform distribution. A proof of this result can be found in the The Book of Statistical Proofs.

In other words if we apply the CDF of any continuous distribution to a random variable with that distribution, the result will be a random variable with a standard uniform distribution. This is a very powerful result, as it allows us to use the standard uniform distribution as a reference distribution for many statistical tests, including posterior predictive checks.

As mentioned earlier, the marginal p-value is given by

\[ p(\tilde y_i \leq y_i \mid y). \]

If the observed data and predictions are drawn from the same distribution, this expression is then equivalent to the definition of the CDF:

\[ F_Y(y) = \mathrm{Pr}(Y \leq y). \]

Thus, we can see the computation of the marginal p-value as an application of the Probability Integral Transform.

In practice we don’t have the CDF, but this is no problem as we have samples from the posterior predictive and hence we can compute the empirical CDF (ECDF). The CDF of the standard Uniform distribution is a diagonal line that goes from (0, 0) to (1,1), as shown in Figure 5.6. Deviations from this line may indicate problems with the model. This is a very simple to interpret plot.

pz.Uniform(0, 1).plot_cdf()
Figure 5.6: The CDF of the standard Uniform distribution.

The disadvantage of such plot is that all the “action” is close to the diagonal line and most of the plot is just blank space. A simple trick to improve the data-ink ratio is to plot the difference between the observed and expected cumulative distribution functions, the \(\Delta\)-ECDF, as shown in Figure Figure 5.7. The last ingredient to improve this visual diagnostic is to add a confidence band. Due to finite sample size we should expect deviations from uniformity, so a confidence band gives us an idea of how much deviation is expected by chance.

azp.plot_ppc_pit(idata);
Figure 5.7: Posterior predictive check for the model of heights using marginal Bayesian p-values, also know as u-values.

In ArviZ, we use the simultaneous confidence bands described by Säilynoja, Bürkner, and Vehtari (2022). The simultaneous confidence bands take into account the probability of observing deviations of the entire curve, as opposed to independent pointwise deviations. The band or envelope has an oval shape because the probability of observing a deviation is null at 0 and 1, all ECDFs must start at 0 and end at 1, and is higher in the middle of the curve.

To build intuition on how to interpret the PIT-ECDF plots we are going to explore four common patterns using synthetic data. The following three plots show four different scenarios, where the observed data follows a standard normal distribution (\(\mu=0, \sigma^2=1\)). In each case, we compare the observed data to predictions where:

  • The mean of the predictions is shifted to the right. The model is overpredicting the data.
  • The mean of the predictions is shifted to the left. The model is underpredicting the data.
  • The predictions have a wider spread. The predictions are too uncertain.
  • The predictions have a narrower spread. The predictions are too certain.

First we show the KDEs of the observed data and the predictions.

observed = pz.Normal(0, 1).rvs(500)

predictions = {}
for i, (mu, sigma) in enumerate([
                                (0.5, 1),  # shifted to the right
                                (-0.5, 1), # shifted to the left
                                (0, 2),    # wider 
                                (0, 0.5),  # narrower
                                ]):
    predictions[f"y{i}"] =  pz.Normal(mu, sigma).rvs((4, 500, 100))

idata_i = azb.from_dict({
    "posterior_predictive":predictions,
    "observed_data": {f"y{i}": observed for i in range(len(predictions))}
})

azp.plot_ppc_dist(idata_i,
                  kind="kde",  
                  plot_kwargs={"remove_axis":False},
                  pc_kwargs={"plot_grid_kws":{"sharey":True}},             
                 );
Figure 5.8: Posterior predictive check with KDEs showing four alternative scenarios.

Then we show the ECDFs of the observed data and the predictions.

azp.plot_ppc_dist(idata_i,
                  kind="ecdf",
                  pc_kwargs={"plot_grid_kws":{"sharey":True}},        
                 );
Figure 5.9: Posterior predictive check with ECDFs showing four alternative scenarios.

Finally, we show the PIT-ECDFs.

azp.plot_ppc_pit(idata_i,
                 plot_kwargs={"ylabel":False},
                 pc_kwargs={"plot_grid_kws":{"sharey":True}},        
                 );
Figure 5.10: Posterior predictive check with PIT-ECDFs showing four alternative scenarios.

Alternatively, we can visualize the coverage of the central posterior credible intervals by setting coverage=True. This allows us to assess whether the credible intervals includes the observed values.

  • If the difference is positive, the model is under-confident: the predictions have a wider spread than the data – they are too uncertain.
  • If the difference is negative, the model is over-confident: the predictions have a narrower spread than the data – they are too certain.
azp.plot_ppc_pit(idata_i,
                 coverage=True,
                 plot_kwargs={"ylabel":False},
                 pc_kwargs={"plot_grid_kws":{"sharey":True}},        
                 );
Figure 5.11: Coverage check showing four alternative scenarios.

5.2.3 Avoiding double-dipping

So far we have being using the data twice, first to fit the model and then to evaluate it. This is a common practice in Bayesian data analysis and it is not a problem as long as we are aware of it. The main goal is to understand how well the model is doing at predicting the data, detecting potential problems, and if possible or desirable improving the model.

Still, we may want to avoid double-dipping. So instead of computing:

\[ p(\tilde y_i \leq y_i \mid y) \]

We may want to compute:

\[ p(\tilde y_i \leq y_i \mid y_{-i}) \]

where \(y_{-i}\) is the observed data without the \(i\)-th observation.

This is a more stringent test, as we are not using the \(i\)-th observation to compute the posterior predictive distribution. This is known as the leave-one-out cross-validation (LOO-CV) and it is a very popular method to assess the predictive performance of a model.

In principle computing this will be too costly, as we need to compute the posterior predictive distribution \(n\) times, where \(n\) is the number of observations. However, we can use a method called Pareto-smoothed importance sampling (PSIS) to approximate the LOO-CV from a single posterior computation. This is a topic we will discuss in more detail in Chapter 7. ArviZ offers many functions based on this method, one of them is loo_pit.

azp.plot_loo_pit(idata);
Figure 5.12: Posterior predictive check with LOO-PIT-ECDF.

5.2.4 Hypothetical Outcome Plots

Another strategy that can be useful for posterior predictive plots is to use animations. Rather than showing a continuous probability distribution, Hypothetical Outcome Plots (HOPs) visualize a set of draws from a distribution, where each draw is shown as a new plot in either a small multiples or animated form. HOPs enable a user to experience uncertainty in terms of countable events, just like we experience probability in our day to day lives.

az.plot_ppc support animations using the option animation=True.

You can read more about HPOs here.

5.3 Posterior predictive checks for discrete data

So far we have show examples with continuous data. Many of the tools can still be used for discrete data, for instance az.plot_ppc will automatically use histograms instead of KDEs when the data is discrete. And the bins of the histograms will be centred at integers. Also cumulative plots can be used for discrete data. Still, there are some tools that are more specific for discrete data. In the next sections we discuss posterior predictive checks for count data and binary data.

5.3.1 Posterior predictive checks for count data

Count data is a type of discrete data that is very common in many fields. For instance, the number of iguanas per square meter in a rainforest, the number of bikes in a bike-sharing station, the number of calls to a call center, the number of emails received, etc. When assessing the fit of a model to count data we need to consider the discretness of the data and that we usually care about the amount of (over/under-)dispersion.

Rootograms are a graphical tool to assess the fit of count data models (Tukey 1977; Kleiber and Zeileis 2016). There are a few variations of rootograms, but traditionally rootograms use bars for the predicted data, a lines+markers for the observed data and instead of plotting the raw data, they show the square root of the observed and predicted counts. Often the uncertainty in the predictions is omitted. The reason to square root the data is to make easier to compare observed and expected frequencies even for low frequencies.

Here we are going to discuss the rootograms presented by Säilynoja et al. (2025). These rootograms emphasises the discreteness of the data and predictions by using points and point-wise credible intervals. And instead of square-rooting the data, it set the y-axis on the square root scale, this makes easier to interpret the data, because we can directly read the frecuencies from the plot (instead of reading the square root) while keeping the advantage of being able to discriminate details at lower frecuencies.

To illustrate rootograms we are going to use the Horseshoe crabs dataset (Brockmann 1996). Very briefly, horseshoe crabs arrive at the beach in pairs for their spawning ritual. Solitary males gather around the nesting couples and vying to fertilize the eggs. These individuals, known as satellite males, often congregate near certain nesting pairs while disregarding others. We used Bambi to create two models a poisson model and a hurdle-negative binomial model for the number of male satellites as a function of the carapace width and color of the female.

We are going to omit the modelling details, and just upload prefitted models.

crabs_poisson = azb.load_arviz_data('crabs_poisson')
crabs_hurdle_nb = azb.load_arviz_data('crabs_hurdle_nb')

Let’s first check the poisson model. We can see that the overall fit is not that bad, but the zeros are underpredicted, and counts 1 to 4 are overpredicted. Most counts from 6 onward are also underpredicted. This pattern is an indication of overdispersion in the data, and the huge difference for 0 indicates an excess of zeros.

pc = azp.plot_ppc_rootogram(crabs_poisson)
pc.viz["satellite"].plot.item().set_xlim(-0.5, 20)
Figure 5.13: Rootogram showing the uncertainty in the predictions for a Poisson model.

Now we will check the fit for the hurdle model. As expected for a hurdle model we get a perfect fit for the zeros. For the positive values, we still get some deviations, but the fit is better than with the Poisson model.

pc = azp.plot_ppc_rootogram(crabs_hurdle_nb)
pc.viz["satellite"].plot.item().set_xlim(-0.5, 20)
Figure 5.14: Rootogram showing the uncertainty in the predictions for a Hurdle Negative Binomial model.

Both models predict more values in the tail than observed, even if with low probability. For both plots, we restrict the x-range to (0, 20).

5.3.2 Posterior predictive checks for binary data

Binary data is a common form of discrete data, often used to represent outcomes like yes/no, success/failure, or 0/1. We may be tempted to asses the fit of a binary model using a bar plot, or a plot similar to the rootogram we showed in the previous section, but this is not a good idea. The reason is that even a very simple model with one parameter corresponding to the proportion of one class, can perfectly model the proportion, and a bar plot will not show any deviation (Säilynoja et al. 2025).

One solution to this challenge is to use the so call calibration or reliability plots. To create this kind of plot we first bin the predicted probabilities (e.g., [0.0–0.1], [0.1–0.2], …, [0.9–1.0]) and then for each bin we compute the fraction of observed positive outcomes. In this way we can compare the predicted probabilities to the observed frequencies. The ideal calibration plot is a diagonal line, where the predicted probabilities are equal to the observed frequencies.

The problem with this approach is that in practice we don’t have good rules to select the bins and different bins can result in plots that look drastically different (Dimitriadis, Gneiting, and Jordan 2021). An alternative is to use the method proposed by Dimitriadis, Gneiting, and Jordan (2021). This method uses conditional event probabilities (CEP), that is the probability that a certain event occurs given that the classifier has assigned a specific predicted probability. To compute the CEPs, the authors use the pool adjacent violators (PAV) algorithm (Ayer et al. 1955), which provides a way to assign CEPs that are monotonic (i.e. they increase or stay the same, but never decrease) with respect to the model predictions. This monotonicity assumption is reasonable for calibrated models, where higher predicted probabilities should correspond to higher actual event probabilities.

Figure 5.15 shows a calibration plot for a dummy logistic regression model. As previously mentioned, the ideal calibration plot is a diagonal line, where the predicted probabilities are equal to the observed frequencies. If the line is above the diagonal, the model is underestimating the probabilities, and if the line is below the diagonal, the model is overestimating the probabilities. The plot also includes the confidence bands for the CEPs. The confidence bands are computed using the method proposed by Dimitriadis, Gneiting, and Jordan (2021).

dt = azb.load_arviz_data('anes')

azp.plot_ppc_pava(dt)
Figure 5.15: PAV-adjusted Calibration plot for a logistic regression model.