An alternative to MCMC?

I think of Markov-Chain Monte Carlo (MCMC) as a kind of directed staggering about, a random walk with a goal. (Sort of like driving in Boston.) It is conceptually simple to grasp as a way to explore the posterior probability distribution of the parameters of interest by sampling only where it is worth sampling from. Thus, a major savings from brute force Monte Carlo, and far more robust than downhill fitting programs. It also gives you the error bar on the parameter for free. What could be better?

Feroz & Hobson (2007, arXiv:0704.3704) describe a technique called Nested Sampling (Skilling 2004), one that could give MCMC a run for its money. It takes the one inefficient part of MCMC — the burn-in phase — and turns that into a virtue. The way it seems to work is to keep track of how the parameter space is traversed as the model parameters {theta} reach the mode of the posterior, and to take the sequence of likelihoods thus obtained L(theta), and turn it around to get theta(L). Neat.

Two big (computational) problems that I see are (1) the calculation of theta(L), and (2) the sampling to discard the tail of L(theta). The former, it seems to me, becomes intractable exactly when the likelihood surface gets complicated. The latter, again, it seems you have to run through just as many iterations as in MCMC to get a decent sample size. Of course, if you have a good theta(L), it does seem to be an improvement over MCMC in that you won’t need to run the chains multiple times to make sure you catch all the modes.

I think the main advantage of MCMC is that it produces and keeps track of marginalized posteriors for each parameter, whereas in this case, you have to essentially keep a full list of samples from the joint posterior and then marginalize over it yourself. The larger the sample size, the harder this gets, and in fact it is a bit difficult to tell whether the nested sampling method is competing with MCMC or Monte Carlo integration.

Is there any reason why this should not be combined with MCMC? i.e., can we use nested sampling from the burn-in phase to figure out the proposal distributions for Metropolis or Metropolis-Hastings in situ, and get the best of both worlds?

3 Comments
  1. David van Dyk:

    You can definitely use the burn-in to set the Metropolis-Hastings jumping
    rules used after burn-in. You can learn about the posterior correlations
    and variances to optimize the jumping rule variances and blocking schemes.
    The hitch is that you have to take it with a grain of salt because the
    burn in is just burn in—it is likely more dispersed than the actual
    posterior—at least if you pick good starting values!

    [Response: Yeah, that is what I had understood. But nested sampling seems to get around that by somehow weighting the likelihoods. Not quite sure how it works or how well it works in real situations.
    -vinay]

    11-18-2007, 12:51 pm
  2. TomLoredo:

    Vinay: First, thanks for posting about the Feroz & Hobson paper; I hadn’t seen it and I certainly
    will give it a look.

    Second, I’m not sure you have correctly characterized MCMC or nested sampling. MCMC does not
    hunt for the mode; in fact, the MCMC sample with the highest prior*likelihood may not be
    a very good estimator for the mode, especially as dimension grows (but it may be a good
    starting point for input into an optimizer). MCMC doesn’t give you error bars “for free”—that’s
    what it’s aiming for in the first place! MCMC correctly wanders around the posterior
    via an algorithm that does not require one to normalize prior*likelihood; an
    unfortunate consequence is that it can’t tell you what the normalization constant
    is (thought there are tricky ways to fudge around this, with various degrees of
    complexity and success depending on the problem). Nested sampling targets that
    normalization constant; it gives you the posterior samples (the “error bars”) “for free.”

    A good way to understand nested sampling is to think of it as building an approximation
    to the *Lesbegue* integral of prior*likelihood, as opposed to its Riemann integral. For
    the familiar Riemann integral of f(x), you divide the x axis into “bins” and add up the
    areas of the resulting narrow vertical “bars” of height f(x). For the Lesbegue version,
    you divide the *ordinate* into bins, and for each bin, find the size (“measure”) of the x
    region corresponding to the intersection of the f-axis bins with f(x). This builds
    the integral out of a bunch of wide but vertically short *horizontal* bars. Nested
    sampling does this by using an algorithm for finding vertical bins (defined by likelihood
    values) that has a cute statistical property: the associate x measures are, statistically,
    distributed as order statistics. So even though you can’t expect to find those measures
    exactly (they would be the areas/volumes between neighboring likelihood contours, over the
    *whole* space), by construction, the algorithm lets you “guess” those measures reasonably
    accurately. It’s truly ingenious.

    There is no theta(L) step involved; indeed, there can’t be—there is no unique solution
    (it’s a contour or surface or hypersurface). The algorithm works in theta space directly,
    and you can use the resulting thetas as posterior samples.

    But you are right about this hard part: “sampling to discard the tail of L(theta).” There
    is no general way to do it (well, maybe the new paper has one!). The few problems that
    have had good success with NS have had a special structure that allowed this sampling to
    be done cleverly and accurately. The one previous cosmological example used an approximate
    method, and to my mind (I only quickly read it) did not make a compelling case that
    the results were accurate; and to the extent they were, it’s because the problem was
    not that challenging. We (my collaboration with some statisticians at Duke on
    exoplanet problems) had a grad student spend a semester trying to get NS to work on a
    simple exoplanet problem (with a complex, multimodal posterior), and it’s the sampling
    “above the tail” where we got hung up. He (Floyd Bullard) could
    get it to work, but it was not as efficient as more common, less clever methods
    (e.g., importance sampling).

    Maybe the new paper will force us to re-examine NS.

    -Tom

    12-06-2007, 2:57 pm
  3. vlk:

    Thanks for those very illuminating comments, Tom. You are of course spot on that what MCMC does is get error bars — in fact my favorite game with astronomers is to shock them by pointing out how you can’t ever really get a definitive “best-fit” with MCMC and yet you will not lack for realistic characterization of parameter uncertainty. Also thanks for clarifying the Lebesgue integral approach — that was what I was trying to say while flailing about with the theta(L) business.

    btw, during one of our CHASC meetings recently, Xiao-li described a fairly general method to get comparative goodness-of-fits out of MCMC. So even though MCMC doesn’t give you a normalization constant, it is possible to work around that to compare nested models. More on that later.

    12-07-2007, 4:08 pm
Leave a comment