It bothers me.

The full description is given http://cxc.harvard.edu/ciao3.4/ahelp/bayes.html about “bayes” under sherpa/ciao[1]. Some sentences kept bothering me and here’s my account for the reason given outside of quotes.

SUBJECT(bayes) CONTEXT(sherpa)
SYNOPSIS
A Bayesian maximum likelihood function.

Maximum likelihood function is common for both Bayesian and frequentist methods. I don’t know get the point why “Bayesian” is particularly added with “maximum likelihood function.”

DESCRIPTION
(snip)
We can relate this likelihood to the Bayesian posterior density for S(i) and B(i)
using Bayes’ Theorem:

p[S(i),B(i) | N(i,S)] = p[S(i)|B(i)] * p[B(i)] * p[N(i,S) | S(i),B(i)] / p[D] .

The factor p[S(i)|B(i)] is the Bayesian prior probability for the source model
amplitude, which is assumed to be constant, and p[D] is an ignorable normalization
constant. The prior probability p[B(i)] is treated differently; we can specify it
using the posterior probability for B(i) off-source:

p[B(i)] = [ A (A B(i))^N(i,B) / N(i,B)! ] * exp[-A B(i)] ,

where A is an “area” factor that rescales the number of predicted background
counts B(i) to the off-source region.

IMPORTANT: this formula is derived assuming that the background is constant as a
function of spatial area, time, etc. If the background is not constant, the Bayes
function should not be used.

Why not? If I rephrase it, what it said is that B(i) is a constant. Then why one bothers to write p[B(i)], a probability density of a constant? The statement sounds self contradictory to me. I guess B(i) is a constant parameter. It would be suitable to write that Background is homogeneous and the Background is describable with homogeneous Poisson process if the above pdf is a correct model for Background. Also, a slight notation change is required. Assuming the Poisson process, we could estimate the background rate (constant parameter) and its density p[B(i)], and this estimate is a constant as stated for p[S(i)|B(i)], a prior probability for the constant source model amplitude.

I think the reason for “Bayes should not used” is that the current sherpa is not capable of executing hierarchical modeling. Nevertheless, I believe one can script the MCMC methodologies with S-Lang/Python to be aggregated with existing sherpa tools to incorporate a possible space dependent density, p[B(i,x,y)]. I was told that currently a constant background regardless of locations and background subtraction is commonly practiced.

To take into account all possible values of B(i), we integrate, or marginalize,
the posterior density p[S(i),B(i) | N(i,S)] over all allowed values of B(i):

p[S(i) | N(i,S)] = (integral)_0^(infinity) p[S(i),B(i) | N(i,S)] dB(i) .

For the constant background case, this integral may be done analytically. We do
not show the final result here; see Loredo. The function -log p[S(i)|N(i,S)] is
minimized to find the best-fit value of S(i). The magnitude of this function
depends upon the number of bins included in the fit and the values of the data
themselves. Hence one cannot analytically assign a `goodness-of-fit’ measure to a
given value of this function. Such a measure can, in principle, be computed by
performing Monte Carlo simulations. One would repeatedly sample new datasets from
the best-fit model, and fit them, and note where the observed function minimum
lies within the derived distribution of minima. (The ability to perform Monte
Carlo simulations is a feature that will be included in a future version of
Sherpa.)

Note on Background Subtraction

Bayesian computation means one way or the other that one is able to get posterior distributions in the presence of various parameters regardless of their kinds: source or background. I wonder why there’s a discrimination such that source parameter has uncertainty whereas the background is constant and is subtracted (yet marginalization is emulated by subtracting different background counts with corresponding weights). It fell awkward to me. Background counts as well as source counts are Poisson random. I would like to know what justifies constant background while one uses probabilistic approaches via Bayesian methods. I would like to know why the mixture model approach – a mixture of source model and background model with marginalization over background by treating B(i) as a nuisance parameter – has not been tried. By casting eye sights broadly on Bayesian modeling methods and basics of probability, more robustly estimating the source model and their parameters is tractable without subtracting background prior to fitting a source model.

The background should not be subtracted from the data when this function is used
The background only needs to be specified, as in this example:
(snip)

EXAMPLES
EXAMPLE 1
Specify the fitting statistic and then confirm it has been set. The method is then
changed from “Levenberg-Marquardt” (the default), since this statistic does not
work with that algorithm.

sherpa> STATISTIC BAYES
sherpa> SHOW STATISTIC
Statistic: Bayes
sherpa> METHOD POWELL
(snip)

I would like to know why it’s not working with Levenberg-Marquardt (LM) but working with Powell. Any references that explain why LM does not work with Bayes?

I do look forward your comments and references, particularly reasons for Bayesian maximum likelihood function and Bugs with LM. Also, I look forward to see off the norm approaches such as modeling fully in Bayesian ways (like van Dyk et al. 2001, yet I see its application rarely) or marginalizing Background without subtraction but simultaneously fitting the source model. There are plenty of rooms to be improved in source model fitting under contamination and distortion of x-ray photon incidents through space, telescope, and signal transmission.

  1. Note that the current sherpa is beta under ciao 4.0 not under ciao 3.4 and a description about “bayes” from the most recent sherpa is not available yet, which means this post needs updates one new release is available[]
4 Comments
  1. brianISU:

    I would also like to point out the quote “We can relate this likelihood to the Bayesian posterior density for S(i) and B(i)
    using Bayes’ Theorem:”. I am not going to say that this quote is wrong but that the posterior can be derived without Bayes theorem. Posteriors can be derived just using definitions of conditional probabilities. It seems that whenever Bayesian statistics is mentioned it is automatically assumed that all posteriors come from Bayes’ theorem and that Bayes’ theorem is the backbone of Bayesian statistics (this was my believe as well when I first was introduced to the Bayesian paradigm). I do not believe that this is true and believe more that the Bayesian statistics backbone is epistemic probability. I just thought I would add this for fun and to generate more discussion. In particular I am really interested in other viewpoints of this.

    12-03-2008, 8:51 pm
  2. TomLoredo:

    Hyunsook: I don’t use Sherpa, but I believe the “bayes” option is based on an algorithm of mine that Peter Freeman helped put into Sherpa; if that’s true, perhaps these comments will be helpful. I think you are right to be bothered, because the documentation is misleading. “bayes” does not implement a “Bayesian maximum likelihood function.” First off, it doesn’t maximize anything; it defines the form of the fit statistic Sherpa uses to be a marginal likelihood function. It assumes there is a background rate that is constant in time and space, so that off-source data can be presumed to measure the background rate that is present for the on-source spectrum. Then, bin by bin, it marginalizes the background rate out of the likelihood function (using a flat prior for each bin). The result is a likelihood function for the source rates in each bin; when you use “bayes,” your fits will maximize this marginal likelihood function.

    Part of your confusion about this is the use of the word “constant.” To a frequentist statistician, it immediately implies a quantity that is not random (and thus cannot be legitimately assigned a frequency distribution). From the Bayesian point of view, what is distributed in a probability distribution p(x) is the probability, p (it’s distributed over the possible values of x, much like a matter density, rho(x)), not the argument, x (considered to be distributed over its possible values in many repeated observations, in the frequentist interpretation) . So it is legitimate to talk about the probability distribution for something believed to be constant.

    Brian: There is nothing the least bit controversial in your statement that Bayesian inference is broader than Bayes’s theorem, the latter just being one of the arsenal of tools used in Bayesian calculations. For many years in my own tutorial lectures on this (e.g., as archived in the CASt summer school lectures), I explicitly emphasize that Bayesian inference uses all of probability theory, the important distinction being that it calculates probabilities for hypotheses. In fact, in my lectures, after deriving Bayes’s theorem, I also derive the law of total probability (the marginalization rule), and state that, in my own applications, I wind up using it more often than Bayes’s theorem. Anyway, I think you’d be hard-pressed to find someone who really uses Bayesian methods who would disagree with your insight!

    12-06-2008, 3:21 pm
  3. TomLoredo:

    Hyunsook asks:

    I would like to know why it’s not working with Levenberg-Marquardt (LM)

    The LM algorithm uses the form of the chi**2 function to develop an approximation to derivatives of the fitting function, used to guide steps to improve the fit. Since “bayes” changes the fit function to something that is not in the chi**2 form (sum of weighted squared differences between data and model), standard LM can’t work with it (nor can it work with other non-Gaussian likelihoods). Put another way, LM is not a generic optimization algorithm; it is specifically tailored to chi**2 minimization. Powell is a generic algorithm, so it can work with the “bayes” marginal likelihood.

    Also, in case it wasn’t clear from my earlier comment, the “bayes” marginal likelihood does not subtract the background; it marginalizes over it (analytically). If you follow the same procedure for Gaussian noise, it just so happens that the result can be expressed in terms of subtracting a background estimate, but that is just a convenient “accident” that comes from the form of the Gaussian. From a Bayesian point of view, the right thing to do is always to marginalize an uncertain background, not to subtract off a background estimate.

    The only reference for the “bayes” algorithm is my paper in the first SCMA volume (ADS link), though I’m working on a more complete description of it (and some related algorithms). It’s also described in most of my CASt summer school lectures. I believe Harrison Prosper independently derived a similar algorithm (for particle physics applications) around the same time. Finally, the quadrature version of the CHASC Bayesian hardness ratio work I think uses a similar algorithm as a first step; I think it’s described in the Appendix to that paper.

    12-06-2008, 3:34 pm
  4. hlee:

    I’m very much obliged to have your responses. As Brian said, I also expect to see more viewpoints to come since there are many ambiguities in “bayes” for people who want to do their x-ray spectral analysis within sherpa with Bayesian methods. Except how to use it and an instruction of avoiding the LM, without references, it does not describe its objective clearly from my viewpoint. I thought people might be turned away after reading it (I began to learn python to get around these ambiguity and implement Bayesian strategies on my own. Ciao 4.0 and sherpa accommodates python in addition to S-lang).

    Thank you,Tom, for your explanation about what “bayes” in sherpa does and why the LM algorithm does not work with “bayes.” Particularly, related to the latter, it all comes to an objective function and how it is defined that determines those algorithms. Depending on the shape of objective functions, strategies must be changed. As a statistician, I rather like to work on robust one and make it work for spectral fitting. Instead of saying LM does not work, I’d like to give a reason why it does not work. However, not knowing what’s inside – “bayes” didn’t explain nor pointed references – I was curious. I hope the documentation you are preparing to be finished soon and “bayes” could hold better explanation about its function and shed more information about Bayesian statistics.

    12-08-2008, 12:26 am
Leave a comment