A lot things do not give you normal even when n is big. Simple examples are logistic regression with rare events or dynamic models.

Also, would you like to enumerate several frequentist methods that are “significantly” more advanced than the “partition and combine” Bayesian algorithms?

First, frequentists are taking the same “partition and combine” approach for big data, for example, “average” or “median” scheme to combine estimates from subsets.

Second, within-iteration-parallel optimization does not always work well. For example, “ADMM” takes much longer steps for convergence, which is likely to kill the savings from parallelization.

Finally, stochastic gradient descent algorithms…well, people now use them just like “partition and combine” MCMC. You run the algorithm for hours and pick one point in the trajectory that gives the largest/smallest function value as your estimator. The performance heavily relies on your tune of the learning rate, and, in most cases, you don’t know whether your algorithm will eventually converge or not…

Oh, forgot to mention MCMC also gives you uncertainty and confidence.

Larger sample size does not necessarily imply a higher accuracy. In fact, the shrinking of the confidence interval somehow increases the risk of losing robustness, because you know your models are wrong…

]]>Thanks. Since this solution forces one to use the full data set at the final A/R stage, this may not be palatable in really big data problems.

]]>I’m assuming that we want an “exact” MCMC scheme, that converges asymptotically to the correct distribution. So I asume that when HMC is used with gradients from subsamples, we accept or reject the result of computing the trajectory using the full data set. Since one may well want to simulate a trajectory that’s thousands of steps long, the final accept/reject decision may not dominate even if it uses all the data. Of course, this won’t be true for really, really big data sets.

]]>Thanks, Radford! The question with these subsampling schemes is rather “how much does it work?”, i.e., what are the losses in terms of bias and variance, and how can they be quantified from the outcome, at little cost? Another comment on the recent “Unbiaised Bayes for Big Data” by Heiko Strathman, Dino Sejdinovic and Mark Girolami is scheduled for this Friday: keep posted!

]]>This has nothing to do with parallelizing MCMC by the way. More with not parallelizing, actually, since parallelization is more obviously applicable if you are computing a gradient from many independent data points, rather than a few. For parallelizing MCMC in general, you could try my “circular coupling” method.

]]>I kind of hope this paper kills off the field, to be honest. My interpretation of it is that in the situations where these “big bayes” data-splitting algorithms work, there is so much information in the data that it’s unclear to me why there is any benefit using Bayes at all. Why not whisper “Bernstein-von Mises” in front of a mirror three times and be done with it? The frequentist technology for solving big-data problems is significantly more advanced than “let’s just subsample and combine” (the bag of bootstraps excepted).

Does Bayesian inference have any place with exchangeable data? (It’s not clear to me that it does, except in the extremely small data, subjective Bayes limit)

]]>(Ever meaning, of course, unless your model is so phenomenally complicated that the parallel proposal hides the latency, yet so convenient that the acceptance ratio can be distributed. [i.e. never])

]]>