ArviZ.jl Quickstart

Note

This tutorial is adapted from ArviZ's quickstart.

using ArviZ
using PyPlot

# ArviZ ships with style sheets!
ArviZ.use_style("arviz-darkgrid")

Get started with plotting

ArviZ.jl is designed to be used with libraries like CmdStan, Turing.jl, and Soss.jl but works fine with raw arrays.

using Random

rng = Random.MersenneTwister(37772)
plot_posterior(randn(rng, 100_000));
gcf()

Plotting a dictionary of arrays, ArviZ.jl will interpret each key as the name of a different random variable. Each row of an array is treated as an independent series of draws from the variable, called a chain. Below, we have 10 chains of 50 draws each for four different distributions.

using Distributions

s = (10, 50)
plot_forest(
    Dict(
        "normal" => randn(rng, s),
        "gumbel" => rand(rng, Gumbel(), s),
        "student t" => rand(rng, TDist(6), s),
        "exponential" => rand(rng, Exponential(), s),
    ),
);
gcf()

Plotting with MCMCChains.jl's `Chains` objects produced by Turing.jl

ArviZ is designed to work well with high dimensional, labelled data. Consider the eight schools model, which roughly tries to measure the effectiveness of SAT classes at eight different schools. To show off ArviZ's labelling, I give the schools the names of a different eight schools.

This model is small enough to write down, is hierarchical, and uses labelling. Additionally, a centered parameterization causes divergences (which are interesting for illustration).

First we create our data and set some sampling parameters.

J = 8
y = [28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0]
σ = [15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0]
schools = [
    "Choate",
    "Deerfield",
    "Phillips Andover",
    "Phillips Exeter",
    "Hotchkiss",
    "Lawrenceville",
    "St. Paul's",
    "Mt. Hermon",
];

nwarmup, nsamples, nchains = 1000, 1000, 4;

Now we write and run the model using Turing:

using Turing

Turing.@model function turing_model(J, y, σ, ::Type{TV}=Vector{Float64}) where {TV}
    begin
        μ ~ Normal(0, 5)
        τ ~ truncated(Cauchy(0, 5), 0, Inf)
        θ = TV(undef, J)
        θ .~ Normal(μ, τ)
        for i in eachindex(y)
            y[i] ~ Normal(θ[i], σ[i])
        end
        return y
    end
end

param_mod = turing_model(J, y, σ)
sampler = NUTS(nwarmup, 0.8)

rng = Random.MersenneTwister(16653)
turing_chns = sample(
    rng, param_mod, sampler, MCMCThreads(), nwarmup + nsamples, nchains; progress=false
);

┌ Info: Found initial step size
└   ϵ = 1.6

Most ArviZ functions work fine with Chains objects from Turing:

plot_autocorr(turing_chns; var_names=["μ", "τ"]);
gcf()

Convert to `InferenceData`

For much more powerful querying, analysis and plotting, we can use built-in ArviZ utilities to convert Chains objects to xarray datasets. Note we are also giving some information about labelling.

ArviZ is built to work with InferenceData (a netcdf datastore that loads data into xarray datasets), and the more groups it has access to, the more powerful analyses it can perform.

idata = from_mcmcchains(
    turing_chns;
    coords=Dict("school" => schools),
    dims=Dict("y" => ["school"], "σ" => ["school"], "θ" => ["school"]),
    library="Turing",
)

InferenceData

Each group is an ArviZ.Dataset (a thinly wrapped xarray.Dataset). We can view a summary of the dataset.

idata.posterior

Dataset (xarray.Dataset)
Dimensions:  (chain: 4, draw: 1000, school: 8)
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
  * school   (school) <U16 'Choate' 'Deerfield' ... "St. Paul's" 'Mt. Hermon'
Data variables:
    μ        (chain, draw) float64 2.242 7.977 5.731 5.574 ... 4.085 2.973 7.224
    τ        (chain, draw) float64 5.526 6.236 4.726 6.596 ... 2.249 1.253 2.106
    θ        (chain, draw, school) float64 4.309 9.479 10.76 ... 8.703 8.909
Attributes:
    created_at:         2021-01-20T07:27:15.683692
    arviz_version:      0.11.0
    inference_library:  Turing

Here is a plot of the trace. Note the intelligent labels.

plot_trace(idata);
gcf()

We can also generate summary stats

summarystats(idata)

10 rows × 12 columns

	variable	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
	String	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	μ	4.516	3.432	-1.244	11.264	0.191	0.154	323.0	248.0	345.0	174.0	1.01
2	τ	3.761	3.245	0.315	9.546	0.266	0.189	148.0	148.0	66.0	27.0	1.05
3	θ[1]	6.354	5.878	-4.06	16.889	0.261	0.185	507.0	507.0	449.0	1436.0	1.01
4	θ[2]	5.227	4.875	-3.001	14.861	0.216	0.17	511.0	410.0	492.0	1054.0	1.01
5	θ[3]	4.03	5.326	-5.698	13.892	0.196	0.156	741.0	584.0	643.0	1421.0	1.0
6	θ[4]	4.846	4.855	-4.4	13.445	0.21	0.149	534.0	534.0	495.0	1570.0	1.01
7	θ[5]	3.615	4.831	-5.744	11.948	0.201	0.16	576.0	455.0	514.0	1079.0	1.0
8	θ[6]	4.157	5.059	-5.959	13.44	0.191	0.157	698.0	521.0	596.0	1135.0	1.0
9	θ[7]	6.559	5.07	-2.274	15.76	0.271	0.192	350.0	350.0	344.0	1419.0	1.01
10	θ[8]	4.913	5.657	-4.391	16.094	0.21	0.149	726.0	726.0	590.0	1497.0	1.01

and examine the energy distribution of the Hamiltonian sampler

plot_energy(idata);
gcf()

Additional information in Turing.jl

With a few more steps, we can use Turing to compute additional useful groups to add to the InferenceData.

To sample from the prior, one simply calls sample but with the Prior sampler:

prior = sample(param_mod, Prior(), nsamples; progress=false)

Chains MCMC chain (1000×11×1 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = θ[1], θ[2], θ[3], θ[4], θ[5], θ[6], θ[7], θ[8], μ, τ
internals         = lp

Summary Statistics
  parameters      mean       std   naive_se      mcse         ess      rhat
      Symbol   Float64   Float64    Float64   Float64     Float64   Float64

        θ[1]    0.9437   67.9184     2.1478    3.0807    556.5721    1.0001
        θ[2]   -0.6113   48.9282     1.5472    1.1096   1023.9913    1.0000
        θ[3]   -1.1874   60.8370     1.9238    1.4449    999.5571    0.9991
        θ[4]    0.1588   52.6023     1.6634    2.0985    812.8250    1.0025
        θ[5]    1.7428   71.2371     2.2527    2.2148    918.9025    0.9993
        θ[6]    2.0271   74.2805     2.3490    2.8746   1004.2599    1.0024
        θ[7]    1.1439   53.3499     1.6871    0.9483   1044.3923    0.9991
        θ[8]    0.7388   58.6210     1.8538    1.8743    964.9287    1.0012
           μ    0.0436    4.8547     0.1535    0.1428   1027.7394    1.0001
           τ   18.2513   68.3494     2.1614    3.0856    638.1239    0.9996

Quantiles
  parameters       2.5%     25.0%     50.0%     75.0%      97.5%
      Symbol    Float64   Float64   Float64   Float64    Float64

        θ[1]   -53.1021   -5.5494    0.0134    5.4924    42.0920
        θ[2]   -47.9095   -4.9937    0.2045    6.0771    38.2048
        θ[3]   -48.4412   -5.0144    0.2915    5.8051    39.0567
        θ[4]   -36.2510   -4.9584    0.4698    6.0296    46.7597
        θ[5]   -41.1661   -5.4494   -0.3852    5.0671    53.3500
        θ[6]   -40.9711   -5.4135    0.2146    5.5908    47.2051
        θ[7]   -39.5905   -5.2035    0.3970    6.4169    50.1237
        θ[8]   -49.2875   -4.9863    0.2679    5.8763    55.9611
           μ    -9.7057   -3.3430   -0.0990    3.3103     9.5634
           τ     0.1994    2.0780    4.9683   11.2756   112.9419

To draw from the prior and posterior predictive distributions we can instantiate a "predictive model", i.e. a Turing model but with the observations set to missing, and then calling predict on the predictive model and the previously drawn samples:

# Instantiate the predictive model
param_mod_predict = turing_model(J, similar(y, Missing), σ)
# and then sample!
prior_predictive = predict(param_mod_predict, prior)
posterior_predictive = predict(param_mod_predict, turing_chns)

Chains MCMC chain (1000×8×4 Array{Float64,3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1, 2, 3, 4
Samples per chain = 1000
parameters        = y[1], y[2], y[3], y[4], y[5], y[6], y[7], y[8]
internals         = 

Summary Statistics
  parameters      mean       std   naive_se      mcse         ess      rhat
      Symbol   Float64   Float64    Float64   Float64     Float64   Float64

        y[1]    4.5082   16.0150     0.2532    0.2930   3268.5480    1.0007
        y[2]    4.6493   11.9315     0.1887    0.3013   2426.6130    1.0007
        y[3]    4.5704   17.2858     0.2733    0.3852   3246.5857    1.0000
        y[4]    4.6143   12.6587     0.2002    0.2960   3443.9512    1.0005
        y[5]    4.4985   10.6522     0.1684    0.2672   1895.2847    1.0018
        y[6]    4.7272   12.4434     0.1967    0.2517   3152.6502    1.0012
        y[7]    4.3443   11.7323     0.1855    0.2427   3064.7075    1.0001
        y[8]    4.9657   18.9625     0.2998    0.3368   3958.3019    1.0001

Quantiles
  parameters       2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol    Float64   Float64   Float64   Float64   Float64

        y[1]   -26.7190   -6.0134    4.6518   15.1441   35.9749
        y[2]   -18.8314   -3.2552    4.5512   12.5614   27.9407
        y[3]   -29.6524   -6.9376    4.6521   16.0361   38.4275
        y[4]   -20.5275   -3.5383    4.7111   12.8250   29.4160
        y[5]   -15.8879   -2.7019    4.7310   11.4978   25.5380
        y[6]   -19.2152   -3.5181    4.4492   13.2566   28.6721
        y[7]   -19.1155   -3.2005    4.6943   12.0731   26.8739
        y[8]   -31.5715   -8.0353    4.6702   17.8301   41.3367

And to extract the pointwise log-likelihoods, which is useful if you want to compute metrics such as loo,

loglikelihoods = Turing.pointwise_loglikelihoods(param_mod, turing_chns)

Dict{String,Array{Float64,2}} with 8 entries:
  "y[6]" => [-3.36285 -4.08918 -3.39046 -3.32844; -3.81335 -3.62125 -3.38203 -3…
  "y[2]" => [-3.23247 -3.22337 -3.23758 -3.65949; -3.28484 -3.26335 -3.23609 -3…
  "y[1]" => [-4.87426 -4.878 -4.71977 -5.22698; -4.08751 -4.70315 -4.60937 -4.7…
  "y[5]" => [-3.2681 -4.81419 -3.37238 -3.16239; -3.43789 -3.95455 -3.39972 -3.…
  "y[8]" => [-4.03523 -3.81134 -3.86919 -3.86369; -3.81057 -3.82789 -3.86432 -3…
  "y[7]" => [-3.72935 -4.31431 -3.92227 -4.56106; -3.4916 -3.48724 -3.94398 -4.…
  "y[3]" => [-4.06155 -3.84478 -3.85029 -3.69183; -3.73377 -3.69364 -3.85202 -3…
  "y[4]" => [-4.04857 -3.33035 -3.31907 -3.32388; -3.75523 -3.33412 -3.31756 -3…

This can then be included in the from_mcmcchains call from above:

using LinearAlgebra
# Ensure the ordering of the loglikelihoods matches the ordering of `posterior_predictive`
ynames = string.(keys(posterior_predictive))
loglikelihoods_vals = getindex.(Ref(loglikelihoods), ynames)
# Reshape into `(nchains, nsamples, size(y)...)`
loglikelihoods_arr = permutedims(cat(loglikelihoods_vals...; dims=3), (2, 1, 3))

idata = from_mcmcchains(
    turing_chns;
    posterior_predictive=posterior_predictive,
    log_likelihood=Dict("y" => loglikelihoods_arr),
    prior=prior,
    prior_predictive=prior_predictive,
    observed_data=Dict("y" => y),
    coords=Dict("school" => schools),
    dims=Dict("y" => ["school"], "σ" => ["school"], "θ" => ["school"]),
    library="Turing",
)

InferenceData

Then we can for example compute the expected leave-one-out (LOO) predictive density, which is an estimate of the out-of-distribution predictive fit of the model:

loo(idata) # higher is better

1 rows × 7 columns

	loo	loo_se	p_loo	n_samples	n_data_points	warning	loo_scale
	Float64	Float64	Float64	Int64	Int64	Bool	String
1	-30.7109	1.34611	0.881344	4000	8	0	log

If the model is well-calibrated, i.e. it replicates the true generative process well, the CDF of the pointwise LOO values should be similarly distributed to a uniform distribution. This can be inspected visually:

plot_loo_pit(idata; y="y", ecdf=true);
gcf()

Plotting with CmdStan.jl outputs

CmdStan.jl and StanSample.jl also default to producing Chains outputs, and we can easily plot these chains.

Here is the same centered eight schools model:

using CmdStan, MCMCChains

schools_code = """
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta[J];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
}

generated quantities {
    vector[J] log_lik;
    vector[J] y_hat;
    for (j in 1:J) {
        log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
        y_hat[j] = normal_rng(theta[j], sigma[j]);
    }
}
"""

schools_dat = Dict("J" => J, "y" => y, "sigma" => σ)
stan_model = Stanmodel(;
    model=schools_code,
    name="schools",
    nchains=nchains,
    num_warmup=nwarmup,
    num_samples=nsamples,
    output_format=:mcmcchains,
    random=CmdStan.Random(28983),
)
_, stan_chns, _ = stan(stan_model, schools_dat; summary=false);

File /home/runner/work/ArviZ.jl/ArviZ.jl/docs/build/tmp/schools.stan will be updated.

plot_density(stan_chns; var_names=["mu", "tau"]);
gcf()

Again, converting to InferenceData, we can get much richer labelling and mixing of data. Note that we're using the same from_cmdstan function used by ArviZ to process cmdstan output files, but through the power of dispatch in Julia, if we pass a Chains object, it instead uses ArviZ.jl's overloads, which forward to from_mcmcchains.

idata = from_cmdstan(
    stan_chns;
    posterior_predictive="y_hat",
    observed_data=Dict("y" => schools_dat["y"]),
    log_likelihood="log_lik",
    coords=Dict("school" => schools),
    dims=Dict(
        "y" => ["school"],
        "sigma" => ["school"],
        "theta" => ["school"],
        "log_lik" => ["school"],
        "y_hat" => ["school"],
    ),
)

InferenceData

Here is a plot showing where the Hamiltonian sampler had divergences:

plot_pair(
    idata;
    coords=Dict("school" => ["Choate", "Deerfield", "Phillips Andover"]),
    divergences=true,
);
gcf()

Plotting with Soss.jl outputs

With Soss, we can define our model for the posterior and easily use it to draw samples from the prior, prior predictive, posterior, and posterior predictive distributions.

First we define our model:

using Soss, NamedTupleTools

mod = Soss.@model (J, σ) begin
    μ ~ Normal(0, 5)
    τ ~ HalfCauchy(5)
    θ ~ iid(J)(Normal(μ, τ))
    y ~ For(1:J) do j
        Normal(θ[j], σ[j])
    end
end

constant_data = (J=J, σ=σ)
param_mod = mod(; constant_data...)

Joint Distribution
    Bound arguments: [J, σ]
    Variables: [τ, μ, θ, y]

@model (J, σ) begin
        τ ~ HalfCauchy(5)
        μ ~ Normal(0, 5)
        θ ~ (iid(J))(Normal(μ, τ))
        y ~ For(1:J) do j
                Normal(θ[j], σ[j])
            end
    end

Then we draw from the prior and prior predictive distributions.

Random.seed!(5298)
prior_priorpred = [
    map(1:(nchains * nsamples)) do _
        draw = rand(param_mod)
        return delete(draw, keys(constant_data))
    end,
];

Next, we draw from the posterior using DynamicHMC.jl.

post = map(1:nchains) do _
    dynamicHMC(param_mod, (y=y,), nsamples)
end;

Finally, we update the posterior samples with draws from the posterior predictive distribution.

pred = predictive(mod, :μ, :τ, :θ)
post_postpred = map(post) do post_draws
    map(post_draws) do post_draw
        pred_draw = rand(pred(post_draw))
        pred_draw = delete(pred_draw, keys(constant_data))
        return merge(pred_draw, post_draw)
    end
end;

Each Soss draw is a NamedTuple. We can plot the rank order statistics of the posterior to identify poor convergence:

plot_rank(post; var_names=["μ", "τ"]);
gcf()

Now we combine all of the samples to an InferenceData:

idata = from_namedtuple(
    post_postpred;
    posterior_predictive=[:y],
    prior=prior_priorpred,
    prior_predictive=[:y],
    observed_data=(y=y,),
    constant_data=constant_data,
    coords=Dict("school" => schools),
    dims=Dict("y" => ["school"], "σ" => ["school"], "θ" => ["school"]),
    library=Soss,
)

InferenceData

We can compare the prior and posterior predictive distributions:

plot_density(
    [idata.posterior_predictive, idata.prior_predictive];
    data_labels=["Post-pred", "Prior-pred"],
    var_names=["y"],
)
gcf()

Environment

using Pkg
Pkg.status()

Status `~/work/ArviZ.jl/ArviZ.jl/docs/Project.toml`
  [131c737c] ArviZ v0.4.11 `~/work/ArviZ.jl/ArviZ.jl`
  [593b3428] CmdStan v6.1.6
  [31c24e10] Distributions v0.23.8
  [e30172f5] Documenter v0.26.1
  [c7f686f2] MCMCChains v4.4.0
  [d9ec5142] NamedTupleTools v0.13.7
  [438e738f] PyCall v1.92.2
  [d330b81b] PyPlot v2.9.0
  [8ce77f84] Soss v0.14.4
  [fce5fe82] Turing v0.14.12
  [37e2e46d] LinearAlgebra

using InteractiveUtils
versioninfo()

Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_CMDSTAN_HOME = /home/runner/work/ArviZ.jl/ArviZ.jl/.cmdstan//cmdstan-2.25.0/
  JULIA_NUM_THREADS = 2