R - Unnatural Consequences

Estimating uncertainty in the Pfizer vaccine effectiveness

On Nov 20, 2020, the New York Times published an article titled: “New Pfizer Results: Coronavirus Vaccine Is Safe and 95% Effective.” Unfortunately, the article does not report the uncertainty in this probability and so we will try to estimate it from data.

Assumptions

n <- 4.4e4  # number of volunteers
r_c <- 162  # number of events in control
r_t <- 8    # number of events in vaccine group

NYT reports a 44 thousand person trial with half of the people going to treatment and half to control. They further report that 162 people developed COVID in the control group and 8 were in the vaccine group.

The Pfizer protocol defines vaccine effectiveness as follows:

\[
\text{VE} = 1 – \frac{p_{t}}{p_{c}}
\] Here \(p_{t}\) is infection rate in vaccinated group and \(p_{c}\) is the rate in the control group.

Model

Also, let’s assume that we have no prior beliefs about the effectiveness rate and so our model is as follows: \[
\begin{align}
p_{c} \sim \textsf{beta}(1, 1) \\
p_{t} \sim \textsf{beta}(1, 1) \\
y_{c} \sim \textsf{binomial}(n_{c},p_{c}) \\
y_{t} \sim \textsf{binomial}(n_{t},p_{t}) \\
\end{align}
\] The treatment effect and \(VE\) can be computed directly from this model.

\[
\begin{align}
\text{effect} = p_{t} – p_{c} \\
\text{VE} = 1 – \frac{p_{t}}{p_{c}}
\end{align}
\]

The effect will have a distribution and to get the probability of the effect (this is different from VE), we sum up the negative area under the effect distribution. For this problem, we do not need Stan, but I am including it here to show how easy it is to specify this model, once we write down the math above.

data {
  int<lower=1> r_c; // num events, control
  int<lower=1> r_t; // num events, treatment
  int<lower=1> n_c; // num cases, control
  int<lower=1> n_t; // num cases, treatment
  int<lower=1> a;   // prior a for beta(a, b)
  int<lower=1> b;   // prior b for beta(a, b)
}
parameters {
  real<lower=0, upper=1> p_c; // binomial p for control
  real<lower=0, upper=1> p_t; // binomial p for treatment 
}
model {
  p_c ~ beta(a, b); // prior for control
  p_t ~ beta(a, b); // prior for treatment
  r_c ~ binomial(n_c, p_c); // likelihood for control
  r_t ~ binomial(n_t, p_t); // likelihood for treatment
}
generated quantities {
  real effect   = p_t - p_c;      // treatment effect
  real VE       = 1 - p_t / p_c;  // vaccine effectiveness
  real log_odds = log(p_t / (1 - p_t)) -
                  log(p_c / (1 - p_c));
}

Running the model and plotting the results

Let’s run this model from R and make a few plots.

library(cmdstanr)
library(posterior)
library(ggplot2)

# first we get the data ready for Stan
d <- list(r_c = r_c, r_t = r_t, n_c = n/2, 
          n_t = n/2, a = 1, b = 1) # beta(1,1) -> uniform prior

# compile the model
mod <- cmdstan_model("vaccine.stan")

# fit the model with MCMC
fit <- mod$sample(
  data = d,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)

## Running MCMC with 4 parallel chains...
## 
## Chain 1 finished in 0.2 seconds.
## Chain 2 finished in 0.2 seconds.
## Chain 3 finished in 0.2 seconds.
## Chain 4 finished in 0.2 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.2 seconds.
## Total execution time: 0.3 seconds.

# extract the draws
draws <- fit$draws()

# Convert to data frame
draws <- posterior::as_draws_df(draws)
head(draws)

## # A draws_df: 6 iterations, 1 chains, and 6 variables
##    lp__    p_c     p_t  effect   VE log_odds
## 1 -1041 0.0072 0.00045 -0.0068 0.94     -2.8
## 2 -1041 0.0075 0.00034 -0.0071 0.95     -3.1
## 3 -1044 0.0084 0.00072 -0.0077 0.91     -2.5
## 4 -1043 0.0065 0.00027 -0.0063 0.96     -3.2
## 5 -1043 0.0068 0.00026 -0.0065 0.96     -3.2
## 6 -1042 0.0078 0.00029 -0.0075 0.96     -3.3
## # ... hidden meta-columns {'.chain', '.iteration', '.draw'}

# look at the distribution of the effect
ggplot(draws, aes(x = effect*1e4)) +
  geom_density(fill = "blue", alpha = .2) +
  expand_limits(y = 0) + theme_minimal() +
  xlab("Size of the effect") +
    ggtitle("Reduciton in infections on treatment per 10,000
             people")

# Probability that there is an effect is the negative mass 
# of the effect distribution; more negative favors treatment 
# -- there is no question that there is an effect

round(mean(draws$effect < 0), 2)

## [1] 1

ggplot(draws, aes(x = log_odds)) +
  geom_density(fill = "blue", alpha = .2) +
  expand_limits(y = 0) + theme_minimal() +
  xlab("Log odds") + 
  ggtitle("Log odds of the treatment effect.
More negative, less likely to get infected on treatment")

label_txt <- paste("median =", round(median(draws$VE), 2))
ggplot(draws, aes(x = VE)) +
  geom_density(fill = "blue", alpha = .2) +
  expand_limits(y = 0) + theme_minimal() +
  geom_vline(xintercept = median(draws$VE), size = 0.2) +
  annotate("text", x = 0.958, y = 10, label = label_txt, size = 3) +
  xlab("Vaccine effectiveness") +
  ggtitle("Pfizer study protocol defines VE = 1 - Pt/Pc")

quant <- round(quantile(draws$VE, probs = c(0.05, 0.5, 0.95)), 2)

So if we observe 162 infections in the placebo group and 8 in the vaccinated group, the vaccine could be considered to be about 0.95 effective with the likely effectiveness anywhere between 0.91 and 0.97 which represents the median and the 90% quantile interval. We should insist that reporters produce uncertainty estimates and the reporters, in turn, should insist that companies provide them.

Diving into Bayes with Stan

In 2012, I wrote a post about how to learn applied statistics without going to grad school. I still think that one does not have to spend a large amount of money to acquire the skills necessary for data analysis. What has changed for me personally is that am finding traditional statistical methods, call them classical or frequentist, or evolved classical based on the Stanford Statistical Learning school or whatever, somewhat unsatisfying.

These generally rely on maximum likelihood estimation (MLE) to generate point estimates and asymptotic properties of estimators to come up with confidence intervals. One of the main issues I have with this approach has nothing to do with MLE versus Bayesian full posterior per se. It has something to do with the fact that the Likelihood function is largely hidden from my view, although there are lots of others issues, some of which I hope to discuss when my understanding sufficiently progresses. I am getting too comfortable just running glm(), ok, not glm() since there is no regularization there, but say glmnet or Random Forest or even bayesglm in R. The latter is of course Bayesian, but still a black box.

I am not sure at this point if I am ready to abandon all the mathematical and algorithmic machinery of Lasso, Random Forests, Gradient Boosting Machines, and so on, but I would like to spend more time thinking and expressing models directly rather than running and tuning abstract algorithms. I am also quite certain I don’t want to write my own model fitting, sampling, and optimization procedures.

Since I would like to approach this problem in a Bayesian way, it also means that my goal is to get to the distribution of the parameter vector \(\theta\) given data \(y\), \(p(\theta | y)\), the posterior. In the Bayesian framework, we still work with the likelihood function \(p(y | \theta)\), but we are not trying to find some unique set of parameter values for which it is maximum (i.e. under which y are most likely.) Instead we want a complete picture of the uncertainty in our parameters that is supported by the data \(y\), our choice of the model (i.e. likelihood, which as Andrew Gelman likes to point out is a form of prior knowledge) and knowledge about the parameters (prior distribution) without relying on asymptotic properties of estimators. In short:

Getting from prior to posterior is hard work unless they happen to be in the same family, which is rarely the case in the wild. The natural question then is where to start. Short of coding everything from scratch, which would be a very long project, even if I knew how to do it, two types of tools are in order: a probabilistic language capable of expressing models, parameters, priors, and their relationships and an MCMC sampler that can get us to the posterior distributions numerically. For a while, the best bet was some flavor of the BUGS language which uses Gibbs. But the state of the art has moved away from Gibbs sampling. All the cool kids these days are playing with Stan which uses a more efficient, Hamiltonian MCMC with NUTS sampler and supports a broader set of models.

To get a jump start on Stan programming, I recently attended a class on Bayesian Inference with Stan taught by Andrew Gelman, Bob Carpenter, and Daniel Lee (thanks to Jared Lander for organizing the class.) I learned a lot and I hope to continue my exploration into Stan and Bayes.

* Thanks to Bob Carpenter for looking over the draft of this post and providing helpful comments.

First Two Weeks of Writing

Jacki and I just submitted the first two chapter to our publisher, so I would like summarize early lessons learned (actually we submitted one chapter, but the editor decided to break the chapters in half; a decision that we fully support.) The chapters includes material on programming style (from R’s point of view), introduction to functions and functional programming, some information on S4 classes mostly from user’s perspective, vectorizing code, debugging and various methods of data access including web scraping and Twitter API.

First the obvious. We underestimated the amount time required to produce the content. No surprises there.

We spent too much time wrestling with the outline. Outlining seems to work well when I know my own writing style, but not so well otherwise. At some (earlier) point we should have just started writing and figured out the detailed chapter structure as we went along. I suspect this will change as we get deeper into the material, but only time will tell.

What does need to be planned is the immediate section. For me it helps to have all the code written and all the visuals produced prior to starting writing. When I tried writing code on the fly, I struggled to make any meaningful progress.

Lastly, it would have really helped if we read each other’s sections more carefully both in terms of synchronizing content and writing style. I hope that the final product does not read like the book was written by two people.

Onto Chapter 2.

Getting Ready to Write a Book

My co-author, Jacki Buros, and I have just signed a contract with Apress to write a book tentatively entitled “Predictive Analytics with R”, which will cover programming best practices, data munging, data exploration, and single and multi-level models with case studies in social media, healthcare, politics, marketing, and the stock market.

Why does the world need another R book? We think there is a shortage of books that deal with the complete and programmer centric analysis of real, dirty, and sometimes unstructured data. Our target audience are people who have some familiarity with statistics, but do not have much experience with programming. Why did we not call the book Data Science blah, blah, blah…? Because Rachel and the Mathbabe already grabbed that title! (ok, kidding)

The book is projected to be about 300 pages across 8 chapters. This is my first experience with writing a book and everything I heard about the process tells me that this is going to be a long and arduous endeavor lasting anywhere from 6 to 8 months. While undertaking a project of this size, I am sure there will be times when I will feel discouraged, overwhelmed, and emotionally and physically exhausted. What better vehicle for coping with these feelings than writing about them! (this is the last exclamation point in this post, promise.)

So this is my first post of what I hope will become my personal diary detailing the writing process. Here is the summary of the events thus far.

A publisher contacted me on LinkedIn and asked if I wanted to write a book.
Jacki and I wrote a proposal describing our target market, competition, and sales estimates based on comparables. We developed an outline and detailed description of each section.
We submitted our proposal (to the original publisher and two other publishers) and received an approval to publish the book from Apress’ editorial board. (Apress was not the original publisher. More on that process after the book is complete.)

We set up a tracking project on Trello (thanks Joel and the Trello team), created a task for every chapter, and a included a detailed checklist for each task.

We have not completed all of the data analysis required for the book, so this is going to be an exercise in model building as well as in writing. If you have any advice about how to make the writing process better or if you think we are batshit crazy, please, post in the comments.

I hope to write a book that we can be proud of. We have a great editorial team and a technical reviewer who is kind of a legend in the R/S world. They will remain anonymous for now, but their identities will be revealed as soon as they give me permission to do so.

I am looking forward to learning about the writing process, about statistics, and about myself. Let the journey begin.

To plot or to ggplot, that is not the question

Producing informative and aesthetically pleasing quantitative visualizations is hard work. Any tool or library that helps me with this task is worth considering. Since I do most of my work in R, I have a choice of using plot, the default plotting library, a more powerful lattice package, and ggplot, which is based on the Grammar of Graphics.

There is usually a tradeoff between the expressiveness of the grammer and the learning curve necessary to master it. I have recently invested 3 days of my life learning the ins and outs of ggplot and I have to say that it has been most rewarding.

The fundamental difference between plot and ggplot is that in plot you manipulate graphical elements directly using predefined functions, whereas in ggplot you build the plot one layer at a time and can supply your own functions, although you can do quite a bit (but not everything) with a function called qplot, which abstracts the layering from the user and works similar to plot. And therefore qplot is exactly where you want to start when upgrading from plot.

To demonstrate, the following R code partly visualizes the famous iris dataset containing Sepal and Petal measurements of three species of Iris flower using the built in plot function.

par (mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.012, las=1)
with(iris, plot(Sepal.Length, Sepal.Width, col=as.numeric(Species)+1, pch=20))
lbs = levels(iris$Species)
legend('topright', legend=lbs, 
       col=2:4, cex=0.7, pch=20, box.lwd=0.5, pt.cex=0.6)

One of the problems with plot is that the default plotting options are poorly chosen, so the first line of code fixed the margins, tick marks, and the orientation of the y axis tick labels. The parameter col=as.numeric(Species) + 1 fixes the color offset at Red as opposed to the default Black. Type palette() at the R prompt to see the default color vector.

The last complication is that plot does not draw the legend for you; it must be specified by hand. And so, if you run the above code in R, you should get the following output.

It took a little bit of work, but the output looks pretty good. Following is the equivalent task using ggplot’s qplot function.

qplot(Sepal.Length, Sepal.Width, data = iris, colour = Species, xlim=c(4,8))

As you can see, ggplot chooses a lot more sensible defaults and in this particular case, the interface for specifying the intent of the user is very simple and intuitive.

A final word of caution. Just like a skier who sticks to blue and green slopes is in danger of never making it out of the intermediate hell, so is the qplot user will never truly master the grammar of graphics. For those who dare to use a much more expressive ggplot(…) function, the rewards are well worth the effort.

Here are some of the ggplot references that I found valuable.

A Better Way to Learn Applied Statistics, Got Zat? (Part 2)

In the second semester of grad school, I remember sitting in a Statistical Inference class watching a very Russian sounding instructor fast forward through an overhead projected PDF document filled with numbered equations and occasionally making comments like: “Vell, ve take zis eqazion on ze top and ve substitude it on ze butom, and zen it verk out. Do you see zat ?” I did not see zat. I don’t think many people saw zat.

In case I come off as an intolerant immigrant hater, let me assure you that as an immigrant from the former Soviet block, I have all due respect for the very bright Russian and non-Russian scientists who came to the United States to seek intellectual and other freedoms. But this post is not about immigration, which incidentally is in need of serious reform. This is about an important subject, which on average is not being taught very well.

This is hardly news, but many courses in Statistics are being taught by very talented statisticians who have no aptitude or interest in the teaching method. But poor instructors are not the only problem. These courses are part of an institution, an institution that is no longer in the business of providing education. Universities predominantly sell accreditation to students, and research to (mostly) the federal government. While I believe that government-sponsored research should be a foundation of modern society, it does not have to be delivered within the confines of a teaching institution. And a university diploma, even from a top school (i.e. accreditation), is at best a proxy for your knowledge and capabilities. For example, if you are a software engineer, Stack Overflow and GitHub provide much more direct evidence of your abilities.

With the cost of higher education skyrocketing, it is reasonable to ask if the traditional university education is still relevant? I am not sure about medicine, but in statistics, the answer is a resounding ‘No.’ Unless you want to be a professor. But chances are you will not be a professor, even if you get your coveted Ph.D.

So for all of you aspiring Data Geeks, I put together a table outlining Online Classes, Books, and Community and Q&A Sites that completely bypass the traditional channels. And if you really want to go to school, most Universities will allow you to audit classes, so that is always an option. Got Zat?

	Online Classes	Books	Community / Q&A
Programming	Computer Science Courses at Udacity. Currently Introduction to Computer Science, Logic and Discrete Mathematics (great for preparation for Probability), Programming Languages, Design of Computer Programs, and Algorithms. For a highly interactive experience try Codecademy.	How to Think Like a Computer Scientist ( Allen B. Downey) Code Complete (Steve McConnell)	Stack Overflow
Foundational Math	Singel Variable Calculus Course on Coursera (they are adding others; check that site often) Khan Academy Linear Algebra Series Khan Academy Calculus Series (including multivariate) Gilbert Strang’s Linear Algebra Course	Intro to Linear Algebra (Gilbert Strang) Calculus, an Intuitive and Physical Approach (Morris Kline)	Math Overflow
Intro to Probability and Statistics	Statistics One from Coursera. This course includes an Introduction to R language. Introduction to Statistics from Udacity.	Stats: Data and Models (Richard De Veaux)	Cross Validated, which tends to be more advanced
Probability and Statistical Theory	It is very lonely here…	Introduction to Probability Models(Sheldon Ross) Statistical Inference (Casella and Berger)	Cross Validated
Applied and Computational Statistics	Machine Learning from Coursera. Statistics and Data Analysis curriculum from Coursera.	Statistical Sleuth(Ramsey and Schafer) Data Analysis Using Regression and Multilevel Models (Gelman) Pattern Recognition and Machine Learning (Chris Bishop) Elements of Statistical Learning (Hastie, Tibshirani, Friedman)	Stack Overflow especially under the R tag New York Open Statistical Programming Meetup, try searching Meetups in your city
Bayesian Statistics	Not to my knowledge, but check the above-mentioned sites.	Bayesian Data Analysis (Gelman) Doing Bayesian Data Analysis (Kruschke)	I don’t know of any specialized sites for this.