Diving into Bayes with Stan

primateIn 2012, I wrote a post about how to learn applied statistics without going to grad school. I still think that one does not have to spend a large amount of money to acquire the skills necessary for data analysis. What has changed for me personally is that am finding traditional statistical methods, call them classical or frequentist, or evolved classical based on the Stanford Statistical Learning school or whatever, somewhat unsatisfying.

These generally rely on maximum likelihood estimation (MLE) to generate point estimates and asymptotic properties of estimators to come up with confidence intervals. One of the main issues I have with this approach has nothing to do with MLE versus Bayesian full posterior per se. It has something to do with the fact that the Likelihood function is largely hidden from my view, although there are lots of others issues, some of which I hope to discuss when my understanding sufficiently progresses.  I am getting too comfortable just running glm(), ok, not glm() since there is no regularization there, but say glmnet or Random Forest or even bayesglm in R.  The latter is of course Bayesian, but still a black box.

I am not sure at this point if I am ready to abandon all the mathematical and algorithmic machinery of Lasso, Random Forests, Gradient Boosting Machines, and so on, but I would like to spend more time thinking and expressing models directly rather than running and tuning abstract algorithms. I am also quite certain I don’t want to write my own model fitting, sampling, and optimization procedures.

Since I would like to approach this problem in a Bayesian way, it also means that my goal is to get to the distribution of the parameter vector \(\theta\) given data \(y\), \(p(\theta | y)\), the posterior. In the Bayesian framework, we still work with the likelihood function \(p(y | \theta)\), but we are not trying to find some unique set of parameter values for which it is maximum (i.e. under which y are most likely.) Instead we want a complete picture of the uncertainty in our parameters that is supported by the data \(y\), our choice of the model (i.e. likelihood, which as Andrew Gelman likes to point out is a form of prior knowledge) and knowledge about the parameters (prior distribution) without relying on asymptotic properties of estimators. In short:

bayes

Getting from prior to posterior is hard work unless they happen to be in the same family, which is rarely the case in the wild. The natural question then is where to start. Short of coding everything from scratch, which would be a very long project, even if I knew how to do it, two types of tools are in order: a probabilistic language capable of expressing models, parameters, priors, and their relationships and an MCMC sampler that can get us to the posterior distributions numerically. For a while, the best bet was some flavor of the BUGS language which uses Gibbs. But the state of the art has moved away from Gibbs sampling.  All the cool kids these days are playing with Stan which uses a more efficient, Hamiltonian MCMC with NUTS sampler and supports a broader set of models.

To get a jump start on Stan programming, I recently attended a class on Bayesian Inference with Stan taught by Andrew Gelman, Bob Carpenter, and Daniel Lee (thanks to Jared Lander for organizing the class.) I learned a lot and I hope to continue my exploration into Stan and Bayes.

* Thanks to Bob Carpenter for looking over the draft of this post and providing helpful comments.