While ArviZ supports plotting from familiar datatypes, such as dictionaries and numpy arrays, there are a couple data structures central to ArviZ that are useful to know when using the library.

They are * xarray.Dataset * arviz.InferenceData * NetCDF

Bayesian Inference generates numerous datasets that represent different aspects of the model. For example in a single analysis a Bayesian practioner could end up with any of the following data. * Prior Distribution for N number of variables * Posterior Distribution for N number of variables * Prior Predictive Distribution * Posterior Predictive Distribution * Trace data for each of the above * Sample statistics for each inference run * Whatever else

Data from probabilistic programming is naturally high dimensional. To
add to the complexity ArviZ must handle the data generated from multiple
Bayesian Modeling libraries, such as pymc3 and pystan. This is an
application that the *xarray* package handles quite well. The xarray
package lets users manage high dimensional data with human readable
dimensions and coordinates quite easily.

Although seemingly more complex at a glance the Arviz devs believe that
the usage of *xarray*, *InferenceData*, and *NetCDF* will simplify the
handling, referencing, and serialization of data generated by MCMC runs.

To help get familiar with each, ArviZ includes some toy datasets. To
start an `az.InferenceData`

sample can be loaded from disk.

```
In [1]:
```

```
# Load the centered eight schools model
import arviz as az
data = az.load_arviz_data('centered_eight')
data
```

```
/home/ArviZUser/miniconda3/envs/arviz/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
```

```
Out[1]:
```

```
Inference data with groups:
> posterior
> sample_stats
> posterior_predictive
> prior
> observed_data
```

In this case the `az.InferenceData`

object contains both a posterior
predictive distribution and the observed data, among other datasets.
Each group in InferenceData is both an attribute on `InferenceData`

and itself a `xarray.Dataset`

object.

```
In [2]:
```

```
# Get the posterior Dataset
posterior = data.posterior
posterior
```

```
Out[2]:
```

```
<xarray.Dataset>
Dimensions: (chain: 4, draw: 500, school: 8)
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
* school (school) object 'Choate' 'Deerfield' 'Phillips Andover' ...
Data variables:
mu (chain, draw) float64 ...
theta (chain, draw, school) float64 ...
tau (chain, draw) float64 ...
```

In our eight schools example the posterior trace consists of 3
variables, estimated over 4 chains. In addition this model is a
hierachial modes where values for the variable `theta`

are associated
with a particular school.

In xarray’s terminology, data variables are the actual values generated
from the MCMC draws, Dimensions are the axes on which refer to the data
variables and coordinates which are pointers to specific slices or
points in the `xarray.Dataset`

Observed data from the Eight Schools model can be accessed through the same method.

```
In [3]:
```

```
# Get the observed xarray
observed_data = data.observed_data
observed_data
```

```
Out[3]:
```

```
<xarray.Dataset>
Dimensions: (school: 8)
Coordinates:
* school (school) object 'Choate' 'Deerfield' 'Phillips Andover' ...
Data variables:
obs (school) float64 ...
```

It should be noted that the observed dataset contains only 8 data
variables and doesn’t have a chain and draw dimension or coordinates
unlike posterior. This difference in sizes is the motivating reason
behind *InferenceData*. Rather than force multiple different sized
arrays into one array, or force users to manage multiple objects
corresponding to different datasets, it is easier to hold references to
each *xarray.Dataset* in an *InferenceData* object.

NetCDF is a standard
for referencing array oriented files. In other words while,
*xarray.Dataset*s, and by extension *InferenceData*, are convenient
for accessing arrays in Python memory, *NetCDF* provides a convenient
mechanism for persistence of model data on disk. In fact the NetCDF
dataset was the inspiration between *InferenceData* as NetCDF4 supports
the concepts of groups. *InferenceData* merely wraps xarray.Dataset with
the same functionality,

Most users will not have to concern themselves with the *NetCDF*
standard but for completeness it is good to make its usage transparent.
It is also worth noting that the NetCDF4 file standard is interoperable
with HDF5 which may be familiar from other contexts.

Earlier in this tutorial *InferenceData* was loaded from a *NetCDF* file

```
In [4]:
```

```
data = az.load_arviz_data('centered_eight')
```

Similarly the *InferenceData* objects can be persisted tp disk in the
NetCDF format

```
In [5]:
```

```
data.to_netcdf("eight_schools_model.nc")
```

```
Out[5]:
```

```
'eight_schools_model.nc'
```

Additional documentation and tutorials exist for xarray and netcdf4.