ab posteriori ad priori

vlk — Sat, 29 Sep 2007 22:03:57 +0000

A great advantage of Bayesian analysis, they say, is the ability to propagate the posterior. That is, if we derive a posterior probability distribution function for a parameter using one dataset, we can apply that as the prior when a new dataset comes along, and thereby improve our estimates of the parameter and shrink the error bars.

But how exactly does it work? I asked this of Tom Loredo in the context of some strange behavior of sequential applications of BEHR that Ian Evans had noticed (specifically that sequential applications of BEHR, using as prior the posterior from the preceding dataset, seemed to be dependent on the order in which the datasets were considered (which, as it happens, arose from approximating the posterior distribution before passing it on as the prior distribution to the next stage — a feature that now has been corrected)), and this is what he said:

Yes, this is a simple theorem. Suppose you have two data sets, D1 and D2, hypotheses H, and background info (model, etc.) I. Considering D2 to be the new piece of info, Bayes’s theorem is:

[1]
p(H|D1,D2) = p(H|D1) p(D2|H, D1)            ||  I
             -------------------
                    p(D2|D1)
where the “|| I” on the right is the “Skilling conditional” indicating that all the probabilities share an “I” on the right of the conditioning solidus (in fact, they also share a D1).

We can instead consider D1 to be the new piece of info; BT then reads:

[2]
p(H|D1,D2) = p(H|D2) p(D1|H, D2)            ||  I
             -------------------
                    p(D1|D2)
Now go back to [1], and use BT on the p(H|D1) factor:
p(H|D1,D2) = p(H) p(D1|H) p(D2|H, D1)            ||  I
             ------------------------
                    p(D1) p(D2|D1)

           = p(H, D1, D2)
             ------------      (by the product rule)
                p(D1,D2)
Do the same to [2]: use BT on the p(H|D2) factor:
p(H|D1,D2) = p(H) p(D2|H) p(D1|H, D2)            ||  I
             ------------------------
                    p(D2) p(D1|D2)

           = p(H, D1, D2)
             ------------      (by the product rule)
                p(D1,D2)
So the results from the two orderings are the same. In fact, in the Cox-Jaynes approach, the “axioms” of probability aren’t axioms, but get derived from desiderata that guarantee this kind of internal consistency of one’s calculations. So this is a very fundamental symmetry.

Note that you have to worry about possible dependence between the data (i.e., p(D2|H, D1) appears in [1], not just p(D2|H)). In practice, separate data are often independent (conditional on H), so p(D2|H, D1) = p(D2|H) (i.e., if you consider H as specified, then D1 tells you nothing about D2 that you don’t already know from H). This is the case, e.g., for basic iid normal data, or Poisson counts. But even in these cases dependences might arise, e.g., if there are nuisance parameters that are common for the two data sets (if you try to combine the info by multiplying *marginalized* posteriors, you may get into trouble; you may need to marginalize *after* multiplying if nuisance parameters are shared, or account for dependence some other way).

what if you had 3, 4, .. N observations? Does the order in which you apply BT affect the results?

No, as long as you use BT correctly and don’t ignore any dependences that might arise.

if not, is there a prescription on what is the Right Thing [TM] to do?

Always obey the laws of probability theory! 9-)

When you observed zero counts, you didn’t not observe any counts

vlk — Mon, 24 Sep 2007 00:28:15 +0000

Dong-Woo, who has been playing with BEHR, noticed that the confidence bounds quoted on the source intensities seem to be unchanged when the source counts are zero, regardless of what the background counts are set to. That is, p(s|N_S,N_B) is invariant when N_S=0, for any value of N_B. This seems a bit odd, because [naively] one expects that as N_B increases, it should/ought to get more and more likely that s gets closer to 0.

Suppose you compute the posterior probability distribution of the intensity of a source, s, when the data include counts in a source region (N_S) and counts in a background region (N_B). When N_S=0, i.e., no counts are observed in the source region,

p(s|N_S=0, N_B) = (1+b)^a/Gamma(a) * s^a-1 * e^-s*(1+b),

where a,b are the parameters of a gamma prior.

Why does N_B have no effect? Because when you have zero counts, the entire effect of the background is going towards evaluating how good the actual chosen model is (so it is become a model comparison problem, not a parameter estimation one), and not into estimating the parameter of interest, the source intensity. That is, into the normalization factor of the probability distribution, p(N_S,N_B). Those parts that depend on N_B cancel out when the expression for p(s|N_S,N_B) is written out because the shape is independent of N_B and the pdf must integrate to 1.

No doubt this is obvious, but I hadn’t noticed it before.

PS: Also shows why upper limits should not be identified with upper confidence bounds.

The AstroStat Slog » BEHR

ab posteriori ad priori

When you observed zero counts, you didn’t not observe any counts