Inference Data Cookbook

InferenceData is the central data format for ArviZ. InferenceData itself is just a container that maintains references to one or more xarray.Dataset. Below are various ways to generate an InferenceData object. See here for more on xarray.

[1]:
import arviz as az
import numpy as np

From 1d numpy array

[2]:
size = 100
dataset = az.convert_to_inference_data(np.random.randn(size))
print(dataset)
dataset.posterior
Inference data with groups:
        > posterior
[2]:
<xarray.Dataset>
Dimensions:  (chain: 1, draw: 100)
Coordinates:
  * chain    (chain) int64 0
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
Data variables:
    x        (chain, draw) float64 0.293 -0.5543 -0.6883 ... -0.1959 0.02409
Attributes:
    created_at:  2018-12-08T16:18:45.917214

From nd numpy array

[3]:
shape = (1, 2, 3, 4, 5)
dataset = az.convert_to_inference_data(np.random.randn(*shape))
print(dataset)
dataset.posterior
Inference data with groups:
        > posterior
[3]:
<xarray.Dataset>
Dimensions:  (chain: 1, draw: 2, x_dim_0: 3, x_dim_1: 4, x_dim_2: 5)
Coordinates:
  * chain    (chain) int64 0
  * draw     (draw) int64 0 1
  * x_dim_0  (x_dim_0) int64 0 1 2
  * x_dim_1  (x_dim_1) int64 0 1 2 3
  * x_dim_2  (x_dim_2) int64 0 1 2 3 4
Data variables:
    x        (chain, draw, x_dim_0, x_dim_1, x_dim_2) float64 0.007123 ... 0.2306
Attributes:
    created_at:  2018-12-08T16:18:45.933243

From a dictionary

[4]:
datadict = {
    'a': np.random.randn(100),
    'b': np.random.randn(1, 100, 10),
    'c': np.random.randn(1, 100, 3, 4),
}
dataset = az.convert_to_inference_data(datadict)
print(dataset)
dataset.posterior
Inference data with groups:
        > posterior
[4]:
<xarray.Dataset>
Dimensions:  (b_dim_0: 10, c_dim_0: 3, c_dim_1: 4, chain: 1, draw: 100)
Coordinates:
  * chain    (chain) int64 0
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
  * b_dim_0  (b_dim_0) int64 0 1 2 3 4 5 6 7 8 9
  * c_dim_0  (c_dim_0) int64 0 1 2
  * c_dim_1  (c_dim_1) int64 0 1 2 3
Data variables:
    a        (chain, draw) float64 -1.483 -1.244 -0.2315 ... -1.501 0.4877
    b        (chain, draw, b_dim_0) float64 2.497 1.039 1.755 ... -2.279 0.3407
    c        (chain, draw, c_dim_0, c_dim_1) float64 -0.4251 0.0699 ... 1.298
Attributes:
    created_at:  2018-12-08T16:18:45.954741

From dictionary with coords and dims

[5]:
datadict = {
    'a': np.random.randn(100),
    'b': np.random.randn(1, 100, 10),
    'c': np.random.randn(1, 100, 3, 4),
}
coords = {'c1' : np.arange(3), 'c2' : np.arange(4), 'b1' : np.arange(10)}
dims = {'b' : ['b1'], 'c' : ['c1', 'c2']}

dataset = az.convert_to_inference_data(datadict, coords=coords, dims=dims)
print(dataset)
dataset.posterior
Inference data with groups:
        > posterior
[5]:
<xarray.Dataset>
Dimensions:  (b1: 10, c1: 3, c2: 4, chain: 1, draw: 100)
Coordinates:
  * chain    (chain) int64 0
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
  * b1       (b1) int64 0 1 2 3 4 5 6 7 8 9
  * c1       (c1) int64 0 1 2
  * c2       (c2) int64 0 1 2 3
Data variables:
    a        (chain, draw) float64 -2.048 -0.3983 1.1 ... -0.2418 0.3686 -0.1518
    b        (chain, draw, b1) float64 -0.02054 -3.143 ... -0.9643 -0.4444
    c        (chain, draw, c1, c2) float64 2.43 -0.439 ... -0.3713 -0.5101
Attributes:
    created_at:  2018-12-08T16:18:45.992123

From pymc3

[6]:
import pymc3 as pm
draws = 500
chains = 2

eight_school_data = {'J': 8,
                     'y': np.array([28., 8., -3., 7., -1., 1., 18., 12.]),
                     'sigma': np.array([15., 10., 16., 11., 9., 11., 10., 18.])
                    }

with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sd=5)
    tau = pm.HalfCauchy('tau', beta=5)
    theta_tilde = pm.Normal('theta_tilde', mu=0, sd=1, shape=eight_school_data['J'])
    theta = pm.Deterministic('theta', mu + tau * theta_tilde)
    pm.Normal('obs', mu=theta, sd=eight_school_data['sigma'], observed=eight_school_data['y'])

    trace = pm.sample(draws, chains=chains)
    prior = pm.sample_prior_predictive()
    posterior_predictive = pm.sample_posterior_predictive(trace, 500, model)

    data = az.from_pymc3(
            trace=trace,
            prior=prior,
            posterior_predictive=posterior_predictive,
            coords={'school': np.arange(eight_school_data['J'])},
            dims={'theta': ['school'], 'theta_tilde': ['school']},
        )
data
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 4 jobs)
NUTS: [theta_tilde, tau, mu]
Sampling 2 chains: 100%|██████████| 2000/2000 [00:01<00:00, 1407.23draws/s]
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
100%|██████████| 500/500 [00:00<00:00, 1951.70it/s]
[6]:
Inference data with groups:
    > posterior
    > sample_stats
    > posterior_predictive
    > prior
    > observed_data

From pystan

[7]:
import pystan

schools_code = '''
        data {
            int<lower=0> J;
            real y[J];
            real<lower=0> sigma[J];
        }

        parameters {
            real mu;
            real<lower=0> tau;
            real theta_tilde[J];
        }

        transformed parameters {
            real theta[J];
            for (j in 1:J)
                theta[j] = mu + tau * theta_tilde[j];
        }

        model {
            mu ~ normal(0, 5);
            tau ~ cauchy(0, 5);
            theta_tilde ~ normal(0, 1);
            y ~ normal(theta, sigma);
        }

        generated quantities {
            vector[J] log_lik;
            vector[J] y_hat;
            for (j in 1:J) {
                log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
                y_hat[j] = normal_rng(theta[j], sigma[j]);
            }
        }
    '''
stan_model = pystan.StanModel(model_code=schools_code)
fit = stan_model.sampling(data=eight_school_data,
                          iter=draws,
                          warmup=0,
                          chains=chains)

data = az.from_pystan(posterior=fit,
                      posterior_predictive='y_hat',
                      observed_data=['y'],
                      log_likelihood='log_lik',
                      coords={'school': np.arange(eight_school_data['J'])},
                      dims={'theta': ['school'],
                             'y': ['school'],
                             'log_lik': ['school'],
                             'y_hat': ['school'],
                             'theta_tilde': ['school']
                            }
                     )

data
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_e7aeb8b836685a923269e6171e7377cd NOW.
/home/tbayes/miniconda3/envs/arviz3.6/lib/python3.6/site-packages/Cython/Compiler/Main.py:367: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/tmpehtg_iwy/stanfit4anon_model_e7aeb8b836685a923269e6171e7377cd_1721201981547910013.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
In file included from /home/tbayes/miniconda3/envs/arviz3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ndarraytypes.h:1821:0,
                 from /home/tbayes/miniconda3/envs/arviz3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
                 from /home/tbayes/miniconda3/envs/arviz3.6/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from /tmp/tmpehtg_iwy/stanfit4anon_model_e7aeb8b836685a923269e6171e7377cd_1721201981547910013.cpp:688:
/home/tbayes/miniconda3/envs/arviz3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
 #warning "Using deprecated NumPy API, disable it by " \
  ^
In file included from /tmp/tmpehtg_iwy/stanfit4anon_model_e7aeb8b836685a923269e6171e7377cd_1721201981547910013.cpp:692:0:
/home/tbayes/miniconda3/envs/arviz3.6/lib/python3.6/site-packages/pystan/stan_fit.hpp: In function ‘int pystan::{anonymous}::command(pystan::StanArgs&, Model&, pystan::StanHolder&, const std::vector<long unsigned int>&, const std::vector<std::__cxx11::basic_string<char> >&, RNG_t&)’:
/home/tbayes/miniconda3/envs/arviz3.6/lib/python3.6/site-packages/pystan/stan_fit.hpp:830:12: warning: ‘template<class> class std::auto_ptr’ is deprecated [-Wdeprecated-declarations]
       std::auto_ptr<stan::io::var_context> init_context_ptr;
            ^
In file included from /usr/include/c++/5/bits/locale_conv.h:41:0,
                 from /usr/include/c++/5/locale:43,
                 from /usr/include/c++/5/iomanip:43,
                 from /home/tbayes/miniconda3/envs/arviz3.6/lib/python3.6/site-packages/pystan/stan_fit.hpp:5,
                 from /tmp/tmpehtg_iwy/stanfit4anon_model_e7aeb8b836685a923269e6171e7377cd_1721201981547910013.cpp:692:
/usr/include/c++/5/bits/unique_ptr.h:49:28: note: declared here
   template<typename> class auto_ptr;
                            ^
/tmp/tmpehtg_iwy/stanfit4anon_model_e7aeb8b836685a923269e6171e7377cd_1721201981547910013.cpp: In function ‘PyObject* __pyx_pf_71stanfit4anon_model_e7aeb8b836685a923269e6171e7377cd_1721201981547910013_2_call_sampler(PyObject*, PyObject*, PyObject*, PyObject*)’:
/tmp/tmpehtg_iwy/stanfit4anon_model_e7aeb8b836685a923269e6171e7377cd_1721201981547910013.cpp:9242:30: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     __pyx_t_12 = ((__pyx_t_9 != __pyx_v_fitptr->param_names_oi().size()) != 0);
                              ^
WARNING:pystan:Rhat for parameter mu is 1.1939833216977542!
WARNING:pystan:Rhat for parameter tau is 1.1755437966827746!
WARNING:pystan:Rhat for parameter theta[1] is 1.1312100863716745!
WARNING:pystan:Rhat for parameter theta[2] is 1.1005746998099943!
WARNING:pystan:Rhat for parameter log_lik[1] is 1.1646655752781883!
WARNING:pystan:Rhat for parameter log_lik[2] is 1.1564767449269702!
WARNING:pystan:Rhat above 1.1 or below 0.9 indicates that the chains very likely have not mixed
WARNING:pystan:72 of 1000 iterations ended with a divergence (7.2%).
WARNING:pystan:Try running with adapt_delta larger than 0.8 to remove the divergences.
[7]:
Inference data with groups:
    > posterior
    > sample_stats
    > posterior_predictive
    > observed_data

From pyro

[8]:
import torch

import pyro
import pyro.distributions as dist
import pyro.poutine as poutine
from pyro.infer.mcmc import MCMC, NUTS

pyro.enable_validation(True)
pyro.set_rng_seed(0)

draws = 1000
warmup_steps = 0
eight_school_data = {'J' : 8,
                     'y' : torch.tensor([28,  8, -3,  7, -1,  1, 18, 12]).type(torch.Tensor),
                     'sigma' : torch.tensor([15, 10, 16, 11, 9, 11, 10, 18]).type(torch.Tensor)
                    }


def model(sigma):
    eta = pyro.sample('eta', dist.Normal(torch.zeros(eight_school_data['J']), torch.ones(eight_school_data['J'])))
    mu = pyro.sample('mu', dist.Normal(torch.zeros(1), 10 * torch.ones(1)))
    tau = pyro.sample('tau', dist.HalfCauchy(scale=25 * torch.ones(1)))

    theta = mu + tau * eta

    return pyro.sample("obs", dist.Normal(theta, sigma))


def conditioned_model(model, sigma, y):
    return poutine.condition(model, data={"obs": y})(sigma)



nuts_kernel = NUTS(conditioned_model, adapt_step_size=True)
posterior = MCMC(nuts_kernel,
                 num_samples=draws,
                 warmup_steps=warmup_steps).run(model, eight_school_data['sigma'], eight_school_data['y'])

pyro_data = az.from_pyro(posterior)
pyro_data
INFO:pyro.infer.mcmc.mcmc:Starting MCMC using kernel - NUTS ...
INFO:pyro.infer.mcmc.mcmc:Iteration: 50 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.125390    Acceptance rate: 0.860000
INFO:pyro.infer.mcmc.mcmc:Iteration: 100 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.789698    Acceptance rate: 0.910000
INFO:pyro.infer.mcmc.mcmc:Iteration: 150 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.282786    Acceptance rate: 0.920000
INFO:pyro.infer.mcmc.mcmc:Iteration: 200 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.105557    Acceptance rate: 0.930000
INFO:pyro.infer.mcmc.mcmc:Iteration: 250 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.527265    Acceptance rate: 0.944000
INFO:pyro.infer.mcmc.mcmc:Iteration: 300 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.200404    Acceptance rate: 0.940000
INFO:pyro.infer.mcmc.mcmc:Iteration: 350 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.286618    Acceptance rate: 0.945714
INFO:pyro.infer.mcmc.mcmc:Iteration: 400 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.323607    Acceptance rate: 0.945000
INFO:pyro.infer.mcmc.mcmc:Iteration: 450 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.600277    Acceptance rate: 0.948889
INFO:pyro.infer.mcmc.mcmc:Iteration: 500 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.265936    Acceptance rate: 0.950000
INFO:pyro.infer.mcmc.mcmc:Iteration: 550 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.423345    Acceptance rate: 0.945455
INFO:pyro.infer.mcmc.mcmc:Iteration: 600 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.675079    Acceptance rate: 0.948333
INFO:pyro.infer.mcmc.mcmc:Iteration: 650 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.523993    Acceptance rate: 0.949231
INFO:pyro.infer.mcmc.mcmc:Iteration: 700 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.372428    Acceptance rate: 0.948571
INFO:pyro.infer.mcmc.mcmc:Iteration: 750 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.536313    Acceptance rate: 0.950667
INFO:pyro.infer.mcmc.mcmc:Iteration: 800 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.508929    Acceptance rate: 0.948750
INFO:pyro.infer.mcmc.mcmc:Iteration: 850 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.551862    Acceptance rate: 0.948235
INFO:pyro.infer.mcmc.mcmc:Iteration: 900 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.283344    Acceptance rate: 0.948889
INFO:pyro.infer.mcmc.mcmc:Iteration: 950 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.411292    Acceptance rate: 0.950526
INFO:pyro.infer.mcmc.mcmc:Iteration: 1000 [SAMPLE]
INFO:pyro.infer.mcmc.mcmc:Step size: 0.372683    Acceptance rate: 0.950000
[8]:
Inference data with groups:
    > posterior
    > observed_data

From emcee

[9]:
import emcee

eight_school_data = {'J': 8,
                     'y': np.array([28., 8., -3., 7., -1., 1., 18., 12.]),
                     'sigma': np.array([15., 10., 16., 11., 9., 11., 10., 18.])
                    }

def log_prior_8school(theta,J):
    mu = theta[0]
    tau = theta[1]
    eta = theta[2:]
    # Half-cauchy prior
    if tau<0:
        return -np.inf
    hwhm = 25
    prior_tau = -np.log(tau**2+hwhm**2)
    prior_mu = -(mu/10)**2  # normal prior, loc=0, scale=10
    prior_eta = -np.sum(eta**2)  # normal prior, loc=0, scale=1
    return prior_mu + prior_tau + prior_eta

def log_likelihood_8school(theta,y,sigma):
    mu = theta[0]
    tau = theta[1]
    eta = theta[2:]
    return -np.sum(((mu + tau * eta - y) / sigma)**2)

def lnprob_8school(theta,J,y,sigma):
    prior = log_prior_8school(theta,J)
    if prior <= -np.inf:
        return -np.inf
    like = log_likelihood_8school(theta,y,sigma)
    return like+prior

nwalkers = 40
ndim = eight_school_data['J']+2
draws = 1500
pos = np.random.normal(size=(nwalkers,ndim))
pos[:,1] = np.absolute(pos[:,1])
sampler = emcee.EnsembleSampler(nwalkers,
                                ndim,
                                lnprob_8school,
                                args=(eight_school_data['J'],
                                      eight_school_data['y'],
                                      eight_school_data['sigma']
                                     )
                               )
sampler.run_mcmc(pos, draws)

# define variable names, it cannot be inferred from emcee
var_names = ['mu','tau']+['eta{}'.format(i) for i in range(eight_school_data['J'])]
emcee_data = az.from_emcee(sampler, var_names = var_names)
emcee_data
[9]:
Inference data with groups:
    > posterior
    > observed_data

From cmdstan

See from_cmdstan for details. Cookbook documentation coming soon.