Data

Inference library converters

ArviZ.from_cmdstanFunction

Convert CmdStan data into an InferenceData object.

Note

This function is forwarded to Python's arviz.from_cmdstan. The docstring of that function is included below.


    For a usage example read the
    :ref:`Creating InferenceData section on from_cmdstan <creating_InferenceData>`

    Parameters
    ----------
    posterior : str or list of str, optional
        List of paths to output.csv files.
    posterior_predictive : str or list of str, optional
        Posterior predictive samples for the fit. If endswith ".csv" assumes file.
    predictions : str or list of str, optional
        Out of sample predictions samples for the fit. If endswith ".csv" assumes file.
    prior : str or list of str, optional
        List of paths to output.csv files
    prior_predictive : str or list of str, optional
        Prior predictive samples for the fit. If endswith ".csv" assumes file.
    observed_data : str, optional
        Observed data used in the sampling. Path to data file in Rdump or JSON format.
    observed_data_var : str or list of str, optional
        Variable(s) used for slicing observed_data. If not defined, all
        data variables are imported.
    constant_data : str, optional
        Constant data used in the sampling. Path to data file in Rdump or JSON format.
    constant_data_var : str or list of str, optional
        Variable(s) used for slicing constant_data. If not defined, all
        data variables are imported.
    predictions_constant_data : str, optional
        Constant data for predictions used in the sampling.
        Path to data file in Rdump or JSON format.
    predictions_constant_data_var : str or list of str, optional
        Variable(s) used for slicing predictions_constant_data.
        If not defined, all data variables are imported.
    log_likelihood : dict of {str: str}, list of str or str, optional
        Pointwise log_likelihood for the data. log_likelihood is extracted from the
        posterior. It is recommended to use this argument as a dictionary whose keys
        are observed variable names and its values are the variables storing log
        likelihood arrays in the Stan code. In other cases, a dictionary with keys
        equal to its values is used. By default, if a variable ``log_lik`` is
        present in the Stan model, it will be retrieved as pointwise log
        likelihood values. Use ``False`` to avoid this behaviour.
    index_origin : int, optional
        Starting value of integer coordinate values. Defaults to the value in rcParam
        ``data.index_origin``.
    coords : dict of {str: array_like}, optional
        A dictionary containing the values that are used as index. The key
        is the name of the dimension, the values are the index values.
    dims : dict of {str: list of str}, optional
        A mapping from variables to a list of coordinate names for the variable.
    disable_glob : bool
        Don't use glob for string input. This means that all string input is
        assumed to be variable names (samples) or a path (data).
    save_warmup : bool
        Save warmup iterations into InferenceData object, if found in the input files.
        If not defined, use default defined by the rcParams.
    dtypes : dict or str
        A dictionary containing dtype information (int, float) for parameters.
        If input is a string, it is assumed to be a model code or path to model code file.

    Returns
    -------
    InferenceData object
    
source
ArviZ.from_mcmcchainsFunction
from_mcmcchains(posterior::MCMCChains.Chains; kwargs...) -> InferenceData
from_mcmcchains(; kwargs...) -> InferenceData
from_mcmcchains(
    posterior::MCMCChains.Chains,
    posterior_predictive,
    predictions,
    log_likelihood;
    kwargs...
) -> InferenceData

Convert data in an MCMCChains.Chains format into an InferenceData.

Any keyword argument below without an an explicitly annotated type above is allowed, so long as it can be passed to convert_to_inference_data.

Arguments

  • posterior::MCMCChains.Chains: Draws from the posterior

Keywords

  • posterior_predictive::Any=nothing: Draws from the posterior predictive distribution or name(s) of predictive variables in posterior
  • predictions: Out-of-sample predictions for the posterior.
  • prior: Draws from the prior
  • prior_predictive: Draws from the prior predictive distribution or name(s) of predictive variables in prior
  • observed_data: Observed data on which the posterior is conditional. It should only contain data which is modeled as a random variable. Keys are parameter names and values.
  • constant_data: Model constants, data included in the model that are not modeled as random variables. Keys are parameter names.
  • predictions_constant_data: Constants relevant to the model predictions (i.e. new x values in a linear regression).
  • log_likelihood: Pointwise log-likelihood for the data. It is recommended to use this argument as a named tuple whose keys are observed variable names and whose values are log likelihood arrays. Alternatively, provide the name of variable in posterior containing log likelihoods.
  • library=MCMCChains: Name of library that generated the chains
  • coords: Map from named dimension to named indices
  • dims: Map from variable name to names of its dimensions
  • eltypes: Map from variable names to eltypes. This is primarily used to assign discrete eltypes to discrete variables that were stored in Chains as floats.

Returns

  • InferenceData: The data with groups corresponding to the provided data
source
ArviZ.from_samplechainsFunction
from_samplechains(
    posterior=nothing;
    prior=nothing,
    library=SampleChains,
    kwargs...,
) -> InferenceData

Convert SampleChains samples to an InferenceData.

Either posterior or prior may be a SampleChains.AbstractChain or SampleChains.MultiChain object.

For descriptions of remaining kwargs, see from_namedtuple.

source

IO / Conversion

ArviZ.from_jsonFunction

Initialize object from a json file.

Note

This function is forwarded to Python's arviz.from_json. The docstring of that function is included below.


    Will use the faster `ujson` (https://github.com/ultrajson/ultrajson) if it is available.

    Parameters
    ----------
    filename : str
        location of json file

    Returns
    -------
    InferenceData object
    
source
InferenceObjectsNetCDF.from_netcdfFunction
from_netcdf(path::AbstractString; kwargs...) -> InferenceData

Load an InferenceData from an unopened NetCDF file.

Remaining kwargs are passed to NCDatasets.NCDataset. This method loads data eagerly. To instead load data lazily, pass an opened NCDataset to from_netcdf.

Examples

julia> idata = from_netcdf("centered_eight.nc")
InferenceData with groups:
  > posterior
  > posterior_predictive
  > sample_stats
  > prior
  > observed_data
from_netcdf(ds::NCDatasets.NCDataset; load_mode) -> InferenceData

Load an InferenceData from an opened NetCDF file.

load_mode defaults to :lazy, which avoids reading variables into memory. Operations on these arrays will be slow. load_mode can also be :eager, which copies all variables into memory. It is then safe to close ds. If load_mode is :lazy and ds is closed after constructing InferenceData, using the variable arrays will have undefined behavior.

Examples

Here is how we might load an InferenceData from an InferenceData lazily from a web-hosted NetCDF file.

julia> using HTTP, NCDatasets

julia> resp = HTTP.get("https://github.com/arviz-devs/arviz_example_data/blob/main/data/centered_eight.nc?raw=true");

julia> ds = NCDataset("centered_eight", "r"; memory = resp.body);

julia> idata = from_netcdf(ds)
InferenceData with groups:
  > posterior
  > posterior_predictive
  > sample_stats
  > prior
  > observed_data

julia> idata_copy = copy(idata); # disconnect from the loaded dataset

julia> close(ds);
InferenceObjectsNetCDF.to_netcdfFunction
to_netcdf(data, dest::AbstractString; group::Symbol=:posterior, kwargs...)
to_netcdf(data, dest::NCDatasets.NCDataset; group::Symbol=:posterior)

Write data to a NetCDF file.

data is any type that can be converted to an InferenceData using convert_to_inference_data. If not an InferenceData, then group specifies which group the data represents.

dest specifies either the path to the NetCDF file or an opened NetCDF file. If dest is a path, remaining kwargs are passed to NCDatasets.NCDataset.

Examples

julia> using NCDatasets

julia> idata = from_namedtuple((; x = randn(4, 100, 3), z = randn(4, 100)))
InferenceData with groups:
  > posterior

julia> to_netcdf(idata, "data.nc")
"data.nc"

General functions

ArviZ.concatFunction

Concatenate InferenceData objects.

Note

This function is forwarded to Python's arviz.concat. The docstring of that function is included below.


    Concatenates over `group`, `chain` or `draw`.
    By default concatenates over unique groups.
    To concatenate over `chain` or `draw` function
    needs identical groups and variables.

    The `variables` in the `data` -group are merged if `dim` are not found.


    Parameters
    ----------
    *args : InferenceData
        Variable length InferenceData list or
        Sequence of InferenceData.
    dim : str, optional
        If defined, concatenated over the defined dimension.
        Dimension which is concatenated. If None, concatenates over
        unique groups.
    copy : bool
        If True, groups are copied to the new InferenceData object.
        Used only if `dim` is None.
    inplace : bool
        If True, merge args to first object.
    reset_dim : bool
        Valid only if dim is not None.

    Returns
    -------
    InferenceData
        A new InferenceData object by default.
        When `inplace==True` merge args to first arg and return `None`

    See Also
    --------
    add_groups : Add new groups to InferenceData object.
    extend : Extend InferenceData with groups from another InferenceData.

    Examples
    --------
    Use ``concat`` method to concatenate InferenceData objects. This will concatenates over
    unique groups by default. We first create an ``InferenceData`` object:

    .. ipython::

        In [1]: import arviz as az
           ...: import numpy as np
           ...: data = {
           ...:     "a": np.random.normal(size=(4, 100, 3)),
           ...:     "b": np.random.normal(size=(4, 100)),
           ...: }
           ...: coords = {"a_dim": ["x", "y", "z"]}
           ...: dataA = az.from_dict(data, coords=coords, dims={"a": ["a_dim"]})
           ...: dataA

    We have created an ``InferenceData`` object with default group 'posterior'. Now, we will
    create another ``InferenceData`` object:

    .. ipython::

        In [1]: dataB = az.from_dict(prior=data, coords=coords, dims={"a": ["a_dim"]})
           ...: dataB

    We have created another ``InferenceData`` object with group 'prior'. Now, we will concatenate
    these two ``InferenceData`` objects:

    .. ipython::

        In [1]: az.concat(dataA, dataB)

    Now, we will concatenate over chain (or draw). It requires identical groups and variables.
    Here we are concatenating two identical ``InferenceData`` objects over dimension chain:

    .. ipython::

        In [1]: az.concat(dataA, dataA, dim="chain")

    It will create an ``InferenceData`` with the original group 'posterior'. In similar way,
    we can also concatenate over draws.

    
source
ArviZ.extractFunction

Extract an InferenceData group or subset of it.

Note

This function is forwarded to Python's arviz.extract. The docstring of that function is included below.


    Parameters
    ----------
    idata : InferenceData or InferenceData_like
        InferenceData from which to extract the data.
    group : str, optional
        Which InferenceData data group to extract data from.
    combined : bool, optional
        Combine ``chain`` and ``draw`` dimensions into ``sample``. Won't work if
        a dimension named ``sample`` already exists.
    var_names : str or list of str, optional
        Variables to be extracted. Prefix the variables by `~` when you want to exclude them.
    filter_vars: {None, "like", "regex"}, optional
        If `None` (default), interpret var_names as the real variables names. If "like",
        interpret var_names as substrings of the real variables names. If "regex",
        interpret var_names as regular expressions on the real variables names. A la
        `pandas.filter`.
        Like with plotting, sometimes it's easier to subset saying what to exclude
        instead of what to include
    num_samples : int, optional
        Extract only a subset of the samples. Only valid if ``combined=True``
    keep_dataset : bool, optional
        If true, always return a DataSet. If false (default) return a DataArray
        when there is a single variable.
    rng : bool, int, numpy.Generator, optional
        Shuffle the samples, only valid if ``combined=True``. By default,
        samples are shuffled if ``num_samples`` is not ``None``, and are left
        in the same order otherwise. This ensures that subsetting the samples doesn't return
        only samples from a single chain and consecutive draws.

    Returns
    -------
    xarray.DataArray or xarray.Dataset

    Examples
    --------
    The default behaviour is to return the posterior group after stacking the chain and
    draw dimensions.

    .. jupyter-execute::

        import arviz as az
        idata = az.load_arviz_data("centered_eight")
        az.extract(idata)

    You can also indicate a subset to be returned, but in variables and in samples:

    .. jupyter-execute::

        az.extract(idata, var_names="theta", num_samples=100)

    To keep the chain and draw dimensions, use ``combined=False``.

    .. jupyter-execute::

        az.extract(idata, group="prior", combined=False)

    
source

Example data

ArviZExampleData.describe_example_dataFunction
describe_example_data() -> String

Return a string containing descriptions of all available datasets.

Examples

julia> describe_example_data("radon") |> println
radon
=====

Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household.

This example uses an EPA study of radon levels in houses in Minnesota to construct a model with a hierarchy over households within a county. The model includes estimates (gamma) for contextual effects of the uranium per household.

See Gelman and Hill (2006) for details on the example, or https://docs.pymc.io/notebooks/multilevel_modeling.html by Chris Fonnesbeck for details on this implementation.

remote: http://ndownloader.figshare.com/files/24067472
ArviZExampleData.load_example_dataFunction
load_example_data(name; kwargs...) -> InferenceObjects.InferenceData
load_example_data() -> Dict{String,AbstractFileMetadata}

Load a local or remote pre-made dataset.

kwargs are forwarded to InferenceObjects.from_netcdf.

Pass no parameters to get a Dict listing all available datasets.

Data files are handled by DataDeps.jl. A file is downloaded only when it is requested and then cached for future use.

Examples

julia> keys(load_example_data())
KeySet for a OrderedCollections.OrderedDict{String, ArviZExampleData.AbstractFileMetadata} with 9 entries. Keys:
  "centered_eight"
  "non_centered_eight"
  "radon"
  "rugby"
  "regression1d"
  "regression10d"
  "classification1d"
  "classification10d"
  "glycan_torsion_angles"

julia> load_example_data("centered_eight")
InferenceData with groups:
  > posterior
  > posterior_predictive
  > log_likelihood
  > sample_stats
  > prior
  > prior_predictive
  > observed_data
  > constant_data