Mathsy Coding

Exploring Population Differences Using Bayesian Analysis In R with RStan

This post is concerned with the eventual goal of exploring population differences (e.g. differences in means) using Bayesian analysing using Stan and RStan, the R interface to Stan.

Continuing on with the theme of modelling population differences, this post will examine that process using the popular Bayesian language / platform, Stan.

The first step will be to install RStan. This is an interface between Stan and R, allowing you to run Stan models from within R. This process is notoriously fiddly and difficult, but the instructions on the RStan page are very good and as long as you follow the steps you should be fine.

First, let’s load the package.

library(rstan)

# recommended options
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
rstan_options(threads_per_chain = 1)

packageVersion("rstan")
[1] ‘2.32.3’

The first thing that we’ll do is run a very simple model: recovering the parameters for a Gaussian from a random selection from the distribution.

wNormal(μ,σ)μNormal(0,1)σHalfNormal(1)\begin{align*} w &\sim \text{Normal}(\mu, \sigma) \\ \mu &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{HalfNormal}(1) \\ \end{align*}
NUM_SAMPLES <- 1e3
mu <- 0
sigma <- 1

w <- rnorm(NUM_SAMPLES, mean=mu, sd=sigma)

Now that we have the data, we need to write the model in Stan. The format is very similar to the way that we wrote the model above. The code for the above model looks like the below in Stan.

// simple_model.stan

// this section is the data that we will add to the model
data {
    int<lower=0> n; // number of samples
    vector[n] w; // the actual samples
}

// these are the parameters to be used in the model
parameters {
    real mu;
    real<lower=0> sigma;
}

// the actual model
model {
    mu ~ normal(0, 1);
    sigma ~ normal(0, 1);

    w ~ normal(mu, sigma); // this describes how the data are distributed
}

Note that in order to avoid an error saying something about an incomplete line, just add a blank line to the end of the Stan file. Apparently it’s a known issue.

Now let’s fit the model using the stan, get a summary, and then extract the parameters (as samples). We need to specify the file which contains the Stan models, along with the data (formatted as a list).

data <- list(n=length(w), w=w)
fit <- stan(file='simple_model.stan', data=data)
summary(fit)
$summary
A matrix: 3 × 10 of type dbl
meanse_meansd2.5%25%50%75%97.5%n_effRhat
mu 0.025972360.00058278450.03262796 -0.04121622 4.538632e-03 0.0260687 0.04786533 0.090498423134.4681.000547
sigma 1.014450240.00042672070.02287461 0.97043331 9.984133e-01 1.0143704 1.03004736 1.059247372873.5571.001364
lp__-515.249687150.02581275571.02424096-518.00629800-5.156437e+02-514.9284938-514.51819964-514.249478921574.4741.001698
$c_summary

Looking at the mean for our parameters, it looks like our model did a good job of recovering the values. Let’s take a mode detailed look by extracting some samples.

params <- extract(fit)
summary(params)
      Length Class  Mode
mu    4000   -none- numeric
sigma 4000   -none- numeric
lp__  4000   -none- numeric

This contains the different samples for the parameters. With the benefit of knowing the underlying distribution, we can plot some of these to see how well we did.

library(ggplot2)
options(repr.plot.width=15, repr.plot.height=8)
x <- seq(-4, 4, by=0.1);
true <- dnorm(x, 0, 1)

plt <- ggplot(data.frame(x=x, y=true), aes(x, y)) +
    geom_line(colour='black', linewidth=2) +
    labs(x='w', y='Density', title="Simulated data for the mean")

for (index in sample(1:nrow(params$mu), 20, replace = TRUE)) {
    df <- data.frame(x=x, y=dnorm(x, mean=params$mu[index]), sd=params$sigma[index])
    plt <- plt +
        geom_line(data=df, mapping=aes(x, y), colour='blue', alpha=0.2)
}

print(plt)

Plot of our predicted and actual data, showing a good match

So this looks pretty good!

Of course, we could do better by doing a posterior predictive check - that is, using our final model to generate some data to ensure that it looks like the data that was used to fit the model. To do this, we can actually add some predicted data into the model. We just need to make a small modification to the Stan model.

// simple_model_ppc.stan

// this section is the data that we will add to the model
data {
    int<lower=0> n; // number of samples
    vector[n] w; // the actual samples
}

// these are the parameters to be used in the model
parameters {
    real mu;
    real<lower=0> sigma;
}

// the actual model
model {
    mu ~ normal(0, 1);
    sigma ~ normal(0, 1);

    w ~ normal(mu, sigma); // this describes how the data are distributed
}

// this is the section that we change - we want to add a posterior predictive check
generated quantities {
    vector[n] w_sim;

    for (i in 1:n) {
        w_sim[i] = normal_rng(mu, sigma);
    }
}
ppc_fit <- stan(file="simple_model_ppc.stan", data=data)
summary(ppc_fit)
ppc_params <- extract(ppc_fit)
$summary
A matrix: 1003 × 10 of type dbl
meanse_meansd2.5%25%50%75%97.5%n_effRhat
mu 0.0258104500.00053882020.03148880-0.03338427 0.004332818 0.0253221610.046922520.087768913415.2661.0005321
sigma 1.0159132010.00037546660.02276205 0.97354499 0.999955002 1.0150595561.031077221.061835583675.1920.9999481
w_sim[1] 0.0125611310.01611723911.02058732-2.01894175-0.657810006 0.0072094960.692604421.971016914009.7660.9996054
w_sim[2] 0.0402530270.01613148941.01840079-1.91445011-0.656381401 0.0305973710.735120732.012089623985.5520.9995883
w_sim[3] 0.0227061460.01605278651.01124083-1.95816825-0.651363176 0.0449163550.711214701.961462123968.3351.0006904
w_sim[28] 0.0338968620.01728993481.01121499-2.00957414-0.629203502 0.0353720240.707140911.990013653420.5841.0002338
w_sim[972] 5.552801e-020.016230831.0229120 -1.983298 -0.6182711 5.770454e-02 0.7213428 2.0550633971.8720.9997000
w_sim[1000] 4.159035e-020.017201121.0235276 -2.003147 -0.6280401 6.721470e-02 0.7225033 2.0640013540.6700.9998228
lp__-5.152061e+020.022516790.9588736-517.805703-515.5911481-5.149248e+02-514.5125150-514.2510231813.4671.0002974
$c_summary

Now we can compare our data to the simulated results by choosing 10 random bunches of simulated data and graphing it, along with the actual data.

df <- data.frame(w=w)

plt <- ggplot(df, aes(w)) +
    geom_density(aes(y=after_stat(scaled)), linewidth=2, colour='black') +
    labs(x='w', y="Density", title="Comparison of actual vs. simulated data")


for(index in sample(1:nrow(ppc_params$w_sim), 10)) {
    plt <- plt +
        geom_density(data=data.frame(x=ppc_params$w_sim[index,]), mapping=aes(x, y=after_stat(scaled)), colour=alpha('blue', 0.2))
}
print(plt)

Another plot of the density of the data with some sample densities, showing a good match between the data

Using Summary Statistics

In addition to using the actual data, we can also use the summary statistics (sample mean and sample standard deviation) to generate a posterior. I’ve found this to be useful in the case where the raw data is inaccessible (due to age or other restrictions) but the summary statistics have been published. To do this, we rely on two facts:

XˉNormal(μ,σ/n)(n1)s2σ2χ2(n1)\begin{align*} \bar{X} &\sim \text{Normal}(\mu, \sigma / \sqrt{n}) \\ \frac{(n-1)s^2}{\sigma^2} &\sim \chi^2(n - 1) \\ \end{align*}

Both of these facts can be encode in a Stan model.

// simple_model_ppc_summary.stan

// this section is the data that we will add to the model
data {
    int<lower=0> n; // number of samples
    real x_bar; // sample mean
    real<lower=0> s; // sample standard deviation
}

// these are the parameters to be used in the model
parameters {
    real mu;
    real<lower=0> sigma;
}

// the actual model
model {
    mu ~ normal(0, 1);
    sigma ~ normal(0, 1);

    real modified_s = (n - 1) * s^2 / sigma^2;

    x_bar ~ normal(mu, sigma / sqrt(n)); // this describes how the data are distributed
    modified_s ~ chi_square(n - 1);
}

// data for the posterior predictive check
generated quantities {
    vector[n] w_sim;

    for (i in 1:n) {
        w_sim[i] = normal_rng(mu, sigma);
    }
}
summary_fit <- stan(file='simple_model_ppc_summary.stan', data=list(n=length(w), x_bar=mean(w), s=sd(w)))
summary(summary_fit)
$summary
A matrix: 1003 × 10 of type dbl
meanse_meansd2.5%25%50%75%97.5%n_effRhat
mu 0.0248342350.00059160980.03339289-0.03943811 0.002238256 0.0246394150.047064070.091415253185.9371.0013604
sigma 1.0154341820.00041439650.02306986 0.97260845 0.999503487 1.0149249751.030535491.061602833099.2581.0000964
w_sim[1] 0.0144550450.01590131041.01510507-1.91725272-0.690326722 0.0181458230.695725282.004110454075.2681.0010727
w_sim[28] 0.0423646980.01573517430.99763643-1.93211158-0.630438791 0.0468434070.725745471.979588254019.7731.0014029
w_sim[972] 1.024258e-020.016530871.0225007 -2.018235 -0.6773512 2.487396e-02 0.7090567 1.9802033825.9181.0001160
w_sim[1000]-5.485209e-030.017076011.0254360 -2.008587 -0.6838811-2.336391e-04 0.6810644 2.0158773606.1531.0010102
lp__ 2.945402e+030.026359601.06397372942.6021442944.9820042 2.945726e+032946.14662812946.4302911629.2371.0020386
$c_summary
summary_params <- extract(summary_fit)

Once again, we can visually check against the original data.

df <- data.frame(w=w)

plt <- ggplot(df, aes(w)) +
    geom_density(aes(y=after_stat(scaled)), linewidth=2, colour='black') +
    labs(x='w', y="Density", title="Comparison of actual vs. simulated data (from summary data)")


for(index in sample(1:nrow(summary_params$w_sim), 10)) {
    plt <- plt +
        geom_density(data=data.frame(x=summary_params$w_sim[index,]), mapping=aes(x, y=after_stat(scaled)), colour=alpha('blue', 0.2))
}
print(plt)

A plot of a good match between the original data and some data simulated from the posterior

Difference of Means

Now that we have the basics out of the way, we can look at the distribution of the differences in means between two populations! The first method we’ll use is to model each population separately, and then get the distribution of the difference of means from that.

data {
    int<lower=0> N;
    vector[N] w_1;
    vector[N] w_2;
}

parameters {
    real mu_1;
    real mu_2;
    real<lower=0> sigma;
}

model {
    mu_1 ~ normal(0, 2);
    mu_2 ~ normal(0, 2);
    sigma ~ normal(0, 1);

    w_1 ~ normal(mu_1, sigma);
    w_2 ~ normal(mu_2, sigma);
}
mu <-c(-1, 1)
sigma <- 1

w_1 <- rnorm(NUM_SAMPLES, mean=mu[1], sd=sigma)
w_2 <- rnorm(NUM_SAMPLES, mean=mu[2], sd=sigma)

difference_fit <- stan(file='difference_means.stan', data=list(N=NUM_SAMPLES, w_1=w_1, w_2=w_2))
summary(difference_fit)
$summary
A matrix: 4 × 10 of type dbl
meanse_meansd2.5%25%50%75%97.5%n_effRhat
mu_1 -1.01064970.00047848630.03100593 -1.0720868 -1.0305666 -1.0105328 -0.9895498 -0.95092284199.0430.9995809
mu_2 0.99767960.00049091180.03090297 0.9376764 0.9767728 0.9977505 1.0185254 1.05871813962.7210.9996719
sigma 0.98201120.00022071970.01584171 0.9512097 0.9714469 0.9818158 0.9922650 1.01314785151.3621.0000500
lp__-964.00377070.02756469361.26107968-967.1965946-964.5648992-963.6988244-963.0975006-962.58873012093.0461.0043513
$c_summary
difference_params <- extract(difference_fit)
differences <- difference_params$mu_1 - difference_params$mu_2

ggplot(data.frame(x=differences), aes(x)) +
    geom_density(aes(y=after_stat(scaled))) +
    labs(x="Difference", y="Density", title="Distribution of the Difference")

A plot of the empirical distribution of the difference of means, showing it centred around -2 as we expectd.

This is basically what we expected to see - a difference of 2. If we wanted, we could also extract a credible interval from the data.

diff_ci <- quantile(differences, probs=c(0.025, 0.975))
diff_ci
2.5%
-2.09156770868759
97.5%
-1.92164095109226

So our 95% credible interval is roughly -2.1 - -1.9.

Of course, if all we wanted was the difference, we could also just model the difference itself as a normally distributed random variable. Since we’ve done that quite a bit already in this post, I’ll leave that one as an exercise for the reader.

Conclusion

In this post, I’ve gone through a few simple examples of using Stan through the RStan package in order to fit Bayesian models, with the the goal of looking at population differences (e.g. a difference in the means). Personally, I quite like using Stan; everything seems to work in a reasonable way, the documentation is good, and creating and extracting information from the models is straightforward.

Further Reading