Learning Bayes from books and online classes

In 2012 I wrote a couple of posts on how to learn statistics without going to grad school. Re-reading it now, it still seems like pretty good advice, although it’s a bit too machine learning and Coursera heavy for my current tastes. One annoying gap at the time was the lack of online resources for learning Bayesian statistics. This is no longer the case, and so here are my top three resources for learning Bayes.

Richard McElreath from the Max Planck Institute for Evolutionary Anthropology recently published the second edition of Statistical Rethinking. In the book, he builds up to inference from probability and first principles and assumes only a basic background in math. I don’t love the obscure chapter names (makes it hard to figure out what’s inside) but this is the kind of book I wish I had when I was learning statistics. The example code had been ported to lots of languages including Stan, PyMC3, Julia, and more. Richard is currently teaching a class called “Statistical Rethinking: A Bayesian Course” with all the materials including lecture videos available on GitHub. For updated videos, check out his YouTube channel.

Aki Vehtari from Aalto University in Finland released his popular Bayesian Data Analysis course online — you can now take it at your own pace. This course uses the 3rd edition of the Bayesian Data Analysis book, available for free in PDF form. This is probably the most comprehensive Bayesian course on the Internet today — his demos in R and Python, lecture notes, and videos are all excellent. I highly recommend it.

For those of us who learned statistics the wrong way or who want to see the comparison to frequentist methods, see Ben Lambert’s “A Student’s Guide to Bayesian Statistics.” His corresponding YouTube lectures are excellent and I refer to them often.

Although not explicitly focused on Bayesian Inference, Regression and Other Stories by Andrew Gelman, Jenifer Hill, and Aki Vehtari is a great book on how to build up and evaluate common regression models while using Bayesian software (rstanarm package). The book covers Causal Inference, which is an unusual and welcome addition to an applied regression book. The book does not cover hierarchical models which will be covered in the not-yet-released “Applied Regression and Multilevel Models.” All the code examples are available on Aki’s website. Aki also has a list of his favorite statistics books.

Finally, I would be remiss not to mention my favorite probability book called “Introduction to Probability” by Joe Blitzstein. The book is available for free in PDF form. Joe has a corresponding class on the EdX platform and his lecture series on YouTube kept me on my spin bike for many morning hours. Another great contribution from team Joe (compiled by William Chen and Joe Blitzstein, with contributions from Sebastian Chiu, Yuan Jiang, Yuqi Hou, and Jessy Hwang) is the probability cheat sheet, currently in its second edition.

What are your favorite Bayesian resources on the Internet? Let us know in the comments.

Estimating uncertainty in the Pfizer vaccine effectiveness

On Nov 20, 2020, the New York Times published an article titled: “New Pfizer Results: Coronavirus Vaccine Is Safe and 95% Effective.” Unfortunately, the article does not report the uncertainty in this probability and so we will try to estimate it from data.
 

Assumptions

n <- 4.4e4  # number of volunteers
r_c <- 162  # number of events in control
r_t <- 8    # number of events in vaccine group

NYT reports a 44 thousand person trial with half of the people going to treatment and half to control. They further report that 162 people developed COVID in the control group and 8 were in the vaccine group.

The Pfizer protocol defines vaccine effectiveness as follows:

\[
\text{VE} = 1 – \frac{p_{t}}{p_{c}}
\]
Here \(p_{t}\) is infection rate in vaccinated group and \(p_{c}\) is the rate in the control group.

Model

Also, let’s assume that we have no prior beliefs about the effectiveness rate and so our model is as follows: \[
\begin{align}
p_{c} \sim \textsf{beta}(1, 1) \\
p_{t} \sim \textsf{beta}(1, 1) \\
y_{c} \sim \textsf{binomial}(n_{c},p_{c}) \\
y_{t} \sim \textsf{binomial}(n_{t},p_{t}) \\
\end{align}
\]
The treatment effect and \(VE\) can be computed directly from this model.

\[
\begin{align}
\text{effect} = p_{t} – p_{c} \\
\text{VE} = 1 – \frac{p_{t}}{p_{c}}
\end{align}
\]

The effect will have a distribution and to get the probability of the effect (this is different from VE), we sum up the negative area under the effect distribution. For this problem, we do not need Stan, but I am including it here to show how easy it is to specify this model, once we write down the math above.

data {
  int<lower=1> r_c; // num events, control
  int<lower=1> r_t; // num events, treatment
  int<lower=1> n_c; // num cases, control
  int<lower=1> n_t; // num cases, treatment
  int<lower=1> a;   // prior a for beta(a, b)
  int<lower=1> b;   // prior b for beta(a, b)
}
parameters {
  real<lower=0, upper=1> p_c; // binomial p for control
  real<lower=0, upper=1> p_t; // binomial p for treatment 
}
model {
  p_c ~ beta(a, b); // prior for control
  p_t ~ beta(a, b); // prior for treatment
  r_c ~ binomial(n_c, p_c); // likelihood for control
  r_t ~ binomial(n_t, p_t); // likelihood for treatment
}
generated quantities {
  real effect   = p_t - p_c;      // treatment effect
  real VE       = 1 - p_t / p_c;  // vaccine effectiveness
  real log_odds = log(p_t / (1 - p_t)) -
                  log(p_c / (1 - p_c));
}

Running the model and plotting the results

Let’s run this model from R and make a few plots.

library(cmdstanr)
library(posterior)
library(ggplot2)

# first we get the data ready for Stan
d <- list(r_c = r_c, r_t = r_t, n_c = n/2, 
          n_t = n/2, a = 1, b = 1) # beta(1,1) -> uniform prior

# compile the model
mod <- cmdstan_model("vaccine.stan")
# fit the model with MCMC
fit <- mod$sample(
  data = d,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)
## Running MCMC with 4 parallel chains...
## 
## Chain 1 finished in 0.2 seconds.
## Chain 2 finished in 0.2 seconds.
## Chain 3 finished in 0.2 seconds.
## Chain 4 finished in 0.2 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.2 seconds.
## Total execution time: 0.3 seconds.
# extract the draws
draws <- fit$draws()

# Convert to data frame
draws <- posterior::as_draws_df(draws)
head(draws)
## # A draws_df: 6 iterations, 1 chains, and 6 variables
##    lp__    p_c     p_t  effect   VE log_odds
## 1 -1041 0.0072 0.00045 -0.0068 0.94     -2.8
## 2 -1041 0.0075 0.00034 -0.0071 0.95     -3.1
## 3 -1044 0.0084 0.00072 -0.0077 0.91     -2.5
## 4 -1043 0.0065 0.00027 -0.0063 0.96     -3.2
## 5 -1043 0.0068 0.00026 -0.0065 0.96     -3.2
## 6 -1042 0.0078 0.00029 -0.0075 0.96     -3.3
## # ... hidden meta-columns {'.chain', '.iteration', '.draw'}
# look at the distribution of the effect
ggplot(draws, aes(x = effect*1e4)) +
  geom_density(fill = "blue", alpha = .2) +
  expand_limits(y = 0) + theme_minimal() +
  xlab("Size of the effect") +
    ggtitle("Reduciton in infections on treatment per 10,000
people")

# Probability that there is an effect is the negative mass 
# of the effect distribution; more negative favors treatment
# -- there is no question that there is an effect
round(mean(draws$effect < 0), 2)
## [1] 1
ggplot(draws, aes(x = log_odds)) +
  geom_density(fill = "blue", alpha = .2) +
  expand_limits(y = 0) + theme_minimal() +
  xlab("Log odds") + 
  ggtitle("Log odds of the treatment effect.
More negative, less likely to get infected on treatment")

label_txt <- paste("median =", round(median(draws$VE), 2))
ggplot(draws, aes(x = VE)) +
  geom_density(fill = "blue", alpha = .2) +
  expand_limits(y = 0) + theme_minimal() +
  geom_vline(xintercept = median(draws$VE), size = 0.2) +
  annotate("text", x = 0.958, y = 10, label = label_txt, size = 3) +
  xlab("Vaccine effectiveness") +
  ggtitle("Pfizer study protocol defines VE = 1 - Pt/Pc")

quant <- round(quantile(draws$VE, probs = c(0.05, 0.5, 0.95)), 2)

So if we observe 162 infections in the placebo group and 8 in the vaccinated group, the vaccine could be considered to be about 0.95 effective with the likely effectiveness anywhere between 0.91 and 0.97 which represents the median and the 90% quantile interval. We should insist that reporters produce uncertainty estimates and the reporters, in turn, should insist that companies provide them.

Diving into Bayes with Stan

primateIn 2012, I wrote a post about how to learn applied statistics without going to grad school. I still think that one does not have to spend a large amount of money to acquire the skills necessary for data analysis. What has changed for me personally is that am finding traditional statistical methods, call them classical or frequentist, or evolved classical based on the Stanford Statistical Learning school or whatever, somewhat unsatisfying.

These generally rely on maximum likelihood estimation (MLE) to generate point estimates and asymptotic properties of estimators to come up with confidence intervals. One of the main issues I have with this approach has nothing to do with MLE versus Bayesian full posterior per se. It has something to do with the fact that the Likelihood function is largely hidden from my view, although there are lots of others issues, some of which I hope to discuss when my understanding sufficiently progresses.  I am getting too comfortable just running glm(), ok, not glm() since there is no regularization there, but say glmnet or Random Forest or even bayesglm in R.  The latter is of course Bayesian, but still a black box.

I am not sure at this point if I am ready to abandon all the mathematical and algorithmic machinery of Lasso, Random Forests, Gradient Boosting Machines, and so on, but I would like to spend more time thinking and expressing models directly rather than running and tuning abstract algorithms. I am also quite certain I don’t want to write my own model fitting, sampling, and optimization procedures.

Since I would like to approach this problem in a Bayesian way, it also means that my goal is to get to the distribution of the parameter vector \(\theta\) given data \(y\), \(p(\theta | y)\), the posterior. In the Bayesian framework, we still work with the likelihood function \(p(y | \theta)\), but we are not trying to find some unique set of parameter values for which it is maximum (i.e. under which y are most likely.) Instead we want a complete picture of the uncertainty in our parameters that is supported by the data \(y\), our choice of the model (i.e. likelihood, which as Andrew Gelman likes to point out is a form of prior knowledge) and knowledge about the parameters (prior distribution) without relying on asymptotic properties of estimators. In short:

bayes

Getting from prior to posterior is hard work unless they happen to be in the same family, which is rarely the case in the wild. The natural question then is where to start. Short of coding everything from scratch, which would be a very long project, even if I knew how to do it, two types of tools are in order: a probabilistic language capable of expressing models, parameters, priors, and their relationships and an MCMC sampler that can get us to the posterior distributions numerically. For a while, the best bet was some flavor of the BUGS language which uses Gibbs. But the state of the art has moved away from Gibbs sampling.  All the cool kids these days are playing with Stan which uses a more efficient, Hamiltonian MCMC with NUTS sampler and supports a broader set of models.

To get a jump start on Stan programming, I recently attended a class on Bayesian Inference with Stan taught by Andrew Gelman, Bob Carpenter, and Daniel Lee (thanks to Jared Lander for organizing the class.) I learned a lot and I hope to continue my exploration into Stan and Bayes.

* Thanks to Bob Carpenter for looking over the draft of this post and providing helpful comments.