Estimating uncertainty in the Pfizer vaccine effectiveness

On Nov 20, 2020, the New York Times published an article titled: “New Pfizer Results: Coronavirus Vaccine Is Safe and 95% Effective.” Unfortunately, the article does not report the uncertainty in this probability and so we will try to estimate it from data.
 

Assumptions

n <- 4.4e4  # number of volunteers
r_c <- 162  # number of events in control
r_t <- 8    # number of events in vaccine group

NYT reports a 44 thousand person trial with half of the people going to treatment and half to control. They further report that 162 people developed COVID in the control group and 8 were in the vaccine group.

The Pfizer protocol defines vaccine effectiveness as follows:

\[
\text{VE} = 1 – \frac{p_{t}}{p_{c}}
\]
Here \(p_{t}\) is infection rate in vaccinated group and \(p_{c}\) is the rate in the control group.

Model

Also, let’s assume that we have no prior beliefs about the effectiveness rate and so our model is as follows: \[
\begin{align}
p_{c} \sim \textsf{beta}(1, 1) \\
p_{t} \sim \textsf{beta}(1, 1) \\
y_{c} \sim \textsf{binomial}(n_{c},p_{c}) \\
y_{t} \sim \textsf{binomial}(n_{t},p_{t}) \\
\end{align}
\]
The treatment effect and \(VE\) can be computed directly from this model.

\[
\begin{align}
\text{effect} = p_{t} – p_{c} \\
\text{VE} = 1 – \frac{p_{t}}{p_{c}}
\end{align}
\]

The effect will have a distribution and to get the probability of the effect (this is different from VE), we sum up the negative area under the effect distribution. For this problem, we do not need Stan, but I am including it here to show how easy it is to specify this model, once we write down the math above.

data {
  int<lower=1> r_c; // num events, control
  int<lower=1> r_t; // num events, treatment
  int<lower=1> n_c; // num cases, control
  int<lower=1> n_t; // num cases, treatment
  int<lower=1> a;   // prior a for beta(a, b)
  int<lower=1> b;   // prior b for beta(a, b)
}
parameters {
  real<lower=0, upper=1> p_c; // binomial p for control
  real<lower=0, upper=1> p_t; // binomial p for treatment 
}
model {
  p_c ~ beta(a, b); // prior for control
  p_t ~ beta(a, b); // prior for treatment
  r_c ~ binomial(n_c, p_c); // likelihood for control
  r_t ~ binomial(n_t, p_t); // likelihood for treatment
}
generated quantities {
  real effect   = p_t - p_c;      // treatment effect
  real VE       = 1 - p_t / p_c;  // vaccine effectiveness
  real log_odds = log(p_t / (1 - p_t)) -
                  log(p_c / (1 - p_c));
}

Running the model and plotting the results

Let’s run this model from R and make a few plots.

library(cmdstanr)
library(posterior)
library(ggplot2)

# first we get the data ready for Stan
d <- list(r_c = r_c, r_t = r_t, n_c = n/2, 
          n_t = n/2, a = 1, b = 1) # beta(1,1) -> uniform prior

# compile the model
mod <- cmdstan_model("vaccine.stan")
# fit the model with MCMC
fit <- mod$sample(
  data = d,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)
## Running MCMC with 4 parallel chains...
## 
## Chain 1 finished in 0.2 seconds.
## Chain 2 finished in 0.2 seconds.
## Chain 3 finished in 0.2 seconds.
## Chain 4 finished in 0.2 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.2 seconds.
## Total execution time: 0.3 seconds.
# extract the draws
draws <- fit$draws()

# Convert to data frame
draws <- posterior::as_draws_df(draws)
head(draws)
## # A draws_df: 6 iterations, 1 chains, and 6 variables
##    lp__    p_c     p_t  effect   VE log_odds
## 1 -1041 0.0072 0.00045 -0.0068 0.94     -2.8
## 2 -1041 0.0075 0.00034 -0.0071 0.95     -3.1
## 3 -1044 0.0084 0.00072 -0.0077 0.91     -2.5
## 4 -1043 0.0065 0.00027 -0.0063 0.96     -3.2
## 5 -1043 0.0068 0.00026 -0.0065 0.96     -3.2
## 6 -1042 0.0078 0.00029 -0.0075 0.96     -3.3
## # ... hidden meta-columns {'.chain', '.iteration', '.draw'}
# look at the distribution of the effect
ggplot(draws, aes(x = effect*1e4)) +
  geom_density(fill = "blue", alpha = .2) +
  expand_limits(y = 0) + theme_minimal() +
  xlab("Size of the effect") +
    ggtitle("Reduciton in infections on treatment per 10,000
people")

# Probability that there is an effect is the negative mass 
# of the effect distribution; more negative favors treatment
# -- there is no question that there is an effect
round(mean(draws$effect < 0), 2)
## [1] 1
ggplot(draws, aes(x = log_odds)) +
  geom_density(fill = "blue", alpha = .2) +
  expand_limits(y = 0) + theme_minimal() +
  xlab("Log odds") + 
  ggtitle("Log odds of the treatment effect.
More negative, less likely to get infected on treatment")

label_txt <- paste("median =", round(median(draws$VE), 2))
ggplot(draws, aes(x = VE)) +
  geom_density(fill = "blue", alpha = .2) +
  expand_limits(y = 0) + theme_minimal() +
  geom_vline(xintercept = median(draws$VE), size = 0.2) +
  annotate("text", x = 0.958, y = 10, label = label_txt, size = 3) +
  xlab("Vaccine effectiveness") +
  ggtitle("Pfizer study protocol defines VE = 1 - Pt/Pc")

quant <- round(quantile(draws$VE, probs = c(0.05, 0.5, 0.95)), 2)

So if we observe 162 infections in the placebo group and 8 in the vaccinated group, the vaccine could be considered to be about 0.95 effective with the likely effectiveness anywhere between 0.91 and 0.97 which represents the median and the 90% quantile interval. We should insist that reporters produce uncertainty estimates and the reporters, in turn, should insist that companies provide them.