arviz.compare

arviz.compare(dataset_dict: Mapping[str, arviz.data.inference_data.InferenceData], ic: Optional[Literal['loo', 'waic']] = None, method: Literal['stacking', 'BB-pseudo-BMA', 'pseudo-MA'] = 'stacking', b_samples: int = 1000, alpha: float = 1, seed=None, scale: Optional[Literal['log', 'negative_log', 'deviance']] = None, var_name: Optional[str] = None)[source]

Compare models based on PSIS-LOO loo or WAIC waic cross-validation.

LOO is leave-one-out (PSIS-LOO loo) cross-validation and WAIC is the widely applicable information criterion. Read more theory here - in a paper by some of the leading authorities on model selection dx.doi.org/10.1111/1467-9868.00353

Parameters
dataset_dict: dict[str] -> InferenceData

A dictionary of model names and arviz.InferenceData objects

ic: str, optional

Information Criterion (PSIS-LOO loo or WAIC waic) used to compare models. Defaults to rcParams["stats.information_criterion"].

method: str, optional

Method used to estimate the weights for each model. Available options are:

  • ‘stacking’ : stacking of predictive distributions.

  • ‘BB-pseudo-BMA’ : pseudo-Bayesian Model averaging using Akaike-type weighting. The weights are stabilized using the Bayesian bootstrap.

  • ‘pseudo-BMA’: pseudo-Bayesian Model averaging using Akaike-type weighting, without Bootstrap stabilization (not recommended).

For more information read https://arxiv.org/abs/1704.02030

b_samples: int, optional default = 1000

Number of samples taken by the Bayesian bootstrap estimation. Only useful when method = ‘BB-pseudo-BMA’. Defaults to rcParams["stats.ic_compare_method"].

alpha: float, optional

The shape parameter in the Dirichlet distribution used for the Bayesian bootstrap. Only useful when method = ‘BB-pseudo-BMA’. When alpha=1 (default), the distribution is uniform on the simplex. A smaller alpha will keeps the final weights more away from 0 and 1.

seed: int or np.random.RandomState instance, optional

If int or RandomState, use it for seeding Bayesian bootstrap. Only useful when method = ‘BB-pseudo-BMA’. Default None the global numpy.random state is used.

scale: str, optional

Output scale for IC. Available options are:

  • log : (default) log-score (after Vehtari et al. (2017))

  • negative_log : -1 * (log-score)

  • deviance : -2 * (log-score)

A higher log-score (or a lower deviance) indicates a model with better predictive accuracy.

var_name: str, optional

If there is more than a single observed variable in the InferenceData, which should be used as the basis for comparison.

Returns
A DataFrame, ordered from best to worst model (measured by information criteria).
The index reflects the key with which the models are passed to this function. The columns are:
rank: The rank-order of the models. 0 is the best.
IC: Information Criteria (PSIS-LOO loo or WAIC waic).

Higher IC indicates higher out-of-sample predictive fit (“better” model). Default LOO. If scale is deviance or negative_log smaller IC indicates higher out-of-sample predictive fit (“better” model).

pIC: Estimated effective number of parameters.
dIC: Relative difference between each IC (PSIS-LOO loo or WAIC waic)

and the lowest IC (PSIS-LOO loo or WAIC waic). The top-ranked model is always 0.

weight: Relative weight for each model.

This can be loosely interpreted as the probability of each model (among the compared model) given the data. By default the uncertainty in the weights estimation is considered using Bayesian bootstrap.

SE: Standard error of the IC estimate.

If method = BB-pseudo-BMA these values are estimated using Bayesian bootstrap.

dSE: Standard error of the difference in IC between each model and the top-ranked model.

It’s always 0 for the top-ranked model.

warning: A value of 1 indicates that the computation of the IC may not be reliable.

This could be indication of WAIC/LOO starting to fail see http://arxiv.org/abs/1507.04544 for details.

scale: Scale used for the IC.

See also

loo

Compute the Pareto Smoothed importance sampling Leave One Out cross-validation.

waic

Compute the widely applicable information criterion.

plot_compare

Summary plot for model comparison.

References

1

Vehtari, A., Gelman, A. & Gabry, J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27, 1413–1432 (2017) see https://doi.org/10.1007/s11222-016-9696-4

Examples

Compare the centered and non centered models of the eight school problem:

In [1]: import arviz as az
   ...: data1 = az.load_arviz_data("non_centered_eight")
   ...: data2 = az.load_arviz_data("centered_eight")
   ...: compare_dict = {"non centered": data1, "centered": data2}
   ...: az.compare(compare_dict)
   ...: 
Out[1]: 
              rank        loo     p_loo  ...       dse  warning  loo_scale
non centered     0 -30.687290  0.841888  ...  0.000000    False        log
centered         1 -30.810374  0.954053  ...  0.086046    False        log

[2 rows x 9 columns]

Compare the models using LOO-CV, returning the IC in log scale and calculating the weights using the stacking method.

In [2]: az.compare(compare_dict, ic="loo", method="stacking", scale="log")
Out[2]: 
              rank        loo     p_loo  ...       dse  warning  loo_scale
non centered     0 -30.687290  0.841888  ...  0.000000    False        log
centered         1 -30.810374  0.954053  ...  0.086046    False        log

[2 rows x 9 columns]