Exploring Population Differences Using Bayesian Analysis In R with RStan

Apr 1, 2024

This post is concerned with the eventual goal of exploring population differences (e.g. differences in means) using Bayesian analysing using Stan and RStan, the R interface to Stan.

Continuing on with the theme of modelling population differences, this post will examine that process using the popular Bayesian language / platform, Stan.

The first step will be to install RStan. This is an interface between Stan and R, allowing you to run Stan models from within R. This process is notoriously fiddly and difficult, but the instructions on the RStan page are very good and as long as you follow the steps you should be fine.

First, let’s load the package.

library(rstan)

# recommended options
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
rstan_options(threads_per_chain = 1)

packageVersion("rstan")

[1] ‘2.32.3’

The first thing that we’ll do is run a very simple model: recovering the parameters for a Gaussian from a random selection from the distribution.

\begin{align*} w &\sim \text{Normal}(\mu, \sigma) \\ \mu &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{HalfNormal}(1) \\ \end{align*}

NUM_SAMPLES <- 1e3
mu <- 0
sigma <- 1

w <- rnorm(NUM_SAMPLES, mean=mu, sd=sigma)

Now that we have the data, we need to write the model in Stan. The format is very similar to the way that we wrote the model above. The code for the above model looks like the below in Stan.

// simple_model.stan

// this section is the data that we will add to the model
data {
    int<lower=0> n; // number of samples
    vector[n] w; // the actual samples
}

// these are the parameters to be used in the model
parameters {
    real mu;
    real<lower=0> sigma;
}

// the actual model
model {
    mu ~ normal(0, 1);
    sigma ~ normal(0, 1);

    w ~ normal(mu, sigma); // this describes how the data are distributed
}

Note that in order to avoid an error saying something about an incomplete line, just add a blank line to the end of the Stan file. Apparently it’s a known issue.

Now let’s fit the model using the stan, get a summary, and then extract the parameters (as samples). We need to specify the file which contains the Stan models, along with the data (formatted as a list).

data <- list(n=length(w), w=w)
fit <- stan(file='simple_model.stan', data=data)
summary(fit)

$summary

A matrix: 3 × 10 of type dbl
	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
mu	0.02597236	0.0005827845	0.03262796	-0.04121622	4.538632e-03	0.0260687	0.04786533	0.09049842	3134.468	1.000547
sigma	1.01445024	0.0004267207	0.02287461	0.97043331	9.984133e-01	1.0143704	1.03004736	1.05924737	2873.557	1.001364
lp__	-515.24968715	0.0258127557	1.02424096	-518.00629800	-5.156437e+02	-514.9284938	-514.51819964	-514.24947892	1574.474	1.001698

$c_summary

Looking at the mean for our parameters, it looks like our model did a good job of recovering the values. Let’s take a mode detailed look by extracting some samples.

params <- extract(fit)
summary(params)

      Length Class  Mode
mu    4000   -none- numeric
sigma 4000   -none- numeric
lp__  4000   -none- numeric

This contains the different samples for the parameters. With the benefit of knowing the underlying distribution, we can plot some of these to see how well we did.

library(ggplot2)
options(repr.plot.width=15, repr.plot.height=8)

x <- seq(-4, 4, by=0.1);
true <- dnorm(x, 0, 1)

plt <- ggplot(data.frame(x=x, y=true), aes(x, y)) +
    geom_line(colour='black', linewidth=2) +
    labs(x='w', y='Density', title="Simulated data for the mean")

for (index in sample(1:nrow(params$mu), 20, replace = TRUE)) {
    df <- data.frame(x=x, y=dnorm(x, mean=params$mu[index]), sd=params$sigma[index])
    plt <- plt +
        geom_line(data=df, mapping=aes(x, y), colour='blue', alpha=0.2)
}

print(plt)

Plot of our predicted and actual data, showing a good match

So this looks pretty good!

Of course, we could do better by doing a posterior predictive check - that is, using our final model to generate some data to ensure that it looks like the data that was used to fit the model. To do this, we can actually add some predicted data into the model. We just need to make a small modification to the Stan model.

// simple_model_ppc.stan

// this section is the data that we will add to the model
data {
    int<lower=0> n; // number of samples
    vector[n] w; // the actual samples
}

// these are the parameters to be used in the model
parameters {
    real mu;
    real<lower=0> sigma;
}

// the actual model
model {
    mu ~ normal(0, 1);
    sigma ~ normal(0, 1);

    w ~ normal(mu, sigma); // this describes how the data are distributed
}

// this is the section that we change - we want to add a posterior predictive check
generated quantities {
    vector[n] w_sim;

    for (i in 1:n) {
        w_sim[i] = normal_rng(mu, sigma);
    }
}

ppc_fit <- stan(file="simple_model_ppc.stan", data=data)
summary(ppc_fit)
ppc_params <- extract(ppc_fit)

$summary

A matrix: 1003 × 10 of type dbl
	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
mu	0.025810450	0.0005388202	0.03148880	-0.03338427	0.004332818	0.025322161	0.04692252	0.08776891	3415.266	1.0005321
sigma	1.015913201	0.0003754666	0.02276205	0.97354499	0.999955002	1.015059556	1.03107722	1.06183558	3675.192	0.9999481
w_sim[1]	0.012561131	0.0161172391	1.02058732	-2.01894175	-0.657810006	0.007209496	0.69260442	1.97101691	4009.766	0.9996054
w_sim[2]	0.040253027	0.0161314894	1.01840079	-1.91445011	-0.656381401	0.030597371	0.73512073	2.01208962	3985.552	0.9995883
w_sim[3]	0.022706146	0.0160527865	1.01124083	-1.95816825	-0.651363176	0.044916355	0.71121470	1.96146212	3968.335	1.0006904
w_sim[28]	0.033896862	0.0172899348	1.01121499	-2.00957414	-0.629203502	0.035372024	0.70714091	1.99001365	3420.584	1.0002338
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
w_sim[972]	5.552801e-02	0.01623083	1.0229120	-1.983298	-0.6182711	5.770454e-02	0.7213428	2.055063	3971.872	0.9997000
w_sim[1000]	4.159035e-02	0.01720112	1.0235276	-2.003147	-0.6280401	6.721470e-02	0.7225033	2.064001	3540.670	0.9998228
lp__	-5.152061e+02	0.02251679	0.9588736	-517.805703	-515.5911481	-5.149248e+02	-514.5125150	-514.251023	1813.467	1.0002974

$c_summary

Now we can compare our data to the simulated results by choosing 10 random bunches of simulated data and graphing it, along with the actual data.

df <- data.frame(w=w)

plt <- ggplot(df, aes(w)) +
    geom_density(aes(y=after_stat(scaled)), linewidth=2, colour='black') +
    labs(x='w', y="Density", title="Comparison of actual vs. simulated data")


for(index in sample(1:nrow(ppc_params$w_sim), 10)) {
    plt <- plt +
        geom_density(data=data.frame(x=ppc_params$w_sim[index,]), mapping=aes(x, y=after_stat(scaled)), colour=alpha('blue', 0.2))
}
print(plt)

Another plot of the density of the data with some sample densities, showing a good match between the data

Using Summary Statistics

In addition to using the actual data, we can also use the summary statistics (sample mean and sample standard deviation) to generate a posterior. I’ve found this to be useful in the case where the raw data is inaccessible (due to age or other restrictions) but the summary statistics have been published. To do this, we rely on two facts:

\begin{align*} \bar{X} &\sim \text{Normal}(\mu, \sigma / \sqrt{n}) \\ \frac{(n-1)s^2}{\sigma^2} &\sim \chi^2(n - 1) \\ \end{align*}

Both of these facts can be encode in a Stan model.

// simple_model_ppc_summary.stan

// this section is the data that we will add to the model
data {
    int<lower=0> n; // number of samples
    real x_bar; // sample mean
    real<lower=0> s; // sample standard deviation
}

// these are the parameters to be used in the model
parameters {
    real mu;
    real<lower=0> sigma;
}

// the actual model
model {
    mu ~ normal(0, 1);
    sigma ~ normal(0, 1);

    real modified_s = (n - 1) * s^2 / sigma^2;

    x_bar ~ normal(mu, sigma / sqrt(n)); // this describes how the data are distributed
    modified_s ~ chi_square(n - 1);
}

// data for the posterior predictive check
generated quantities {
    vector[n] w_sim;

    for (i in 1:n) {
        w_sim[i] = normal_rng(mu, sigma);
    }
}

summary_fit <- stan(file='simple_model_ppc_summary.stan', data=list(n=length(w), x_bar=mean(w), s=sd(w)))
summary(summary_fit)

$summary

A matrix: 1003 × 10 of type dbl
	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
mu	0.024834235	0.0005916098	0.03339289	-0.03943811	0.002238256	0.024639415	0.04706407	0.09141525	3185.937	1.0013604
sigma	1.015434182	0.0004143965	0.02306986	0.97260845	0.999503487	1.014924975	1.03053549	1.06160283	3099.258	1.0000964
w_sim[1]	0.014455045	0.0159013104	1.01510507	-1.91725272	-0.690326722	0.018145823	0.69572528	2.00411045	4075.268	1.0010727
w_sim[28]	0.042364698	0.0157351743	0.99763643	-1.93211158	-0.630438791	0.046843407	0.72574547	1.97958825	4019.773	1.0014029
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
w_sim[972]	1.024258e-02	0.01653087	1.0225007	-2.018235	-0.6773512	2.487396e-02	0.7090567	1.980203	3825.918	1.0001160
w_sim[1000]	-5.485209e-03	0.01707601	1.0254360	-2.008587	-0.6838811	-2.336391e-04	0.6810644	2.015877	3606.153	1.0010102
lp__	2.945402e+03	0.02635960	1.0639737	2942.602144	2944.9820042	2.945726e+03	2946.1466281	2946.430291	1629.237	1.0020386

$c_summary

summary_params <- extract(summary_fit)

Once again, we can visually check against the original data.

df <- data.frame(w=w)

plt <- ggplot(df, aes(w)) +
    geom_density(aes(y=after_stat(scaled)), linewidth=2, colour='black') +
    labs(x='w', y="Density", title="Comparison of actual vs. simulated data (from summary data)")


for(index in sample(1:nrow(summary_params$w_sim), 10)) {
    plt <- plt +
        geom_density(data=data.frame(x=summary_params$w_sim[index,]), mapping=aes(x, y=after_stat(scaled)), colour=alpha('blue', 0.2))
}
print(plt)

A plot of a good match between the original data and some data simulated from the posterior

Difference of Means

Now that we have the basics out of the way, we can look at the distribution of the differences in means between two populations! The first method we’ll use is to model each population separately, and then get the distribution of the difference of means from that.

data {
    int<lower=0> N;
    vector[N] w_1;
    vector[N] w_2;
}

parameters {
    real mu_1;
    real mu_2;
    real<lower=0> sigma;
}

model {
    mu_1 ~ normal(0, 2);
    mu_2 ~ normal(0, 2);
    sigma ~ normal(0, 1);

    w_1 ~ normal(mu_1, sigma);
    w_2 ~ normal(mu_2, sigma);
}

mu <-c(-1, 1)
sigma <- 1

w_1 <- rnorm(NUM_SAMPLES, mean=mu[1], sd=sigma)
w_2 <- rnorm(NUM_SAMPLES, mean=mu[2], sd=sigma)

difference_fit <- stan(file='difference_means.stan', data=list(N=NUM_SAMPLES, w_1=w_1, w_2=w_2))
summary(difference_fit)

$summary

A matrix: 4 × 10 of type dbl
	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
mu_1	-1.0106497	0.0004784863	0.03100593	-1.0720868	-1.0305666	-1.0105328	-0.9895498	-0.9509228	4199.043	0.9995809
mu_2	0.9976796	0.0004909118	0.03090297	0.9376764	0.9767728	0.9977505	1.0185254	1.0587181	3962.721	0.9996719
sigma	0.9820112	0.0002207197	0.01584171	0.9512097	0.9714469	0.9818158	0.9922650	1.0131478	5151.362	1.0000500
lp__	-964.0037707	0.0275646936	1.26107968	-967.1965946	-964.5648992	-963.6988244	-963.0975006	-962.5887301	2093.046	1.0043513

$c_summary

difference_params <- extract(difference_fit)
differences <- difference_params$mu_1 - difference_params$mu_2

ggplot(data.frame(x=differences), aes(x)) +
    geom_density(aes(y=after_stat(scaled))) +
    labs(x="Difference", y="Density", title="Distribution of the Difference")

A plot of the empirical distribution of the difference of means, showing it centred around -2 as we expectd.

This is basically what we expected to see - a difference of 2. If we wanted, we could also extract a credible interval from the data.

diff_ci <- quantile(differences, probs=c(0.025, 0.975))
diff_ci

2.5%: -2.09156770868759
97.5%: -1.92164095109226

So our 95% credible interval is roughly -2.1 - -1.9.

Of course, if all we wanted was the difference, we could also just model the difference itself as a normally distributed random variable. Since we’ve done that quite a bit already in this post, I’ll leave that one as an exercise for the reader.

Conclusion

In this post, I’ve gone through a few simple examples of using Stan through the RStan package in order to fit Bayesian models, with the the goal of looking at population differences (e.g. a difference in the means). Personally, I quite like using Stan; everything seems to work in a reasonable way, the documentation is good, and creating and extracting information from the models is straightforward.

Mathsy Coding

Exploring Population Differences Using Bayesian Analysis In R with RStan

Using Summary Statistics

Difference of Means

Conclusion

Further Reading