Comments on: It bothers me. http://hea-www.harvard.edu/AstroStat/slog/2008/it-bothers-me/ Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 01 Jun 2012 18:47:52 +0000 hourly 1 http://wordpress.org/?v=3.4 By: hlee http://hea-www.harvard.edu/AstroStat/slog/2008/it-bothers-me/comment-page-1/#comment-824 hlee Mon, 08 Dec 2008 04:26:52 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=1232#comment-824 I'm very much obliged to have your responses. As Brian said, I also expect to see more viewpoints to come since there are many ambiguities in "bayes" for people who want to do their x-ray spectral analysis within sherpa with Bayesian methods. Except how to use it and an instruction of avoiding the LM, without references, it does not describe its objective clearly from my viewpoint. I thought people might be turned away after reading it (I began to learn python to get around these ambiguity and implement Bayesian strategies on my own. Ciao 4.0 and sherpa accommodates python in addition to S-lang). Thank you,Tom, for your explanation about what "bayes" in sherpa does and why the LM algorithm does not work with "bayes." Particularly, related to the latter, it all comes to an objective function and how it is defined that determines those algorithms. Depending on the shape of objective functions, strategies must be changed. As a statistician, I rather like to work on robust one and make it work for spectral fitting. Instead of saying LM does not work, I'd like to give a reason why it does not work. However, not knowing what's inside - "bayes" didn't explain nor pointed references - I was curious. I hope the documentation you are preparing to be finished soon and "bayes" could hold better explanation about its function and shed more information about Bayesian statistics. I’m very much obliged to have your responses. As Brian said, I also expect to see more viewpoints to come since there are many ambiguities in “bayes” for people who want to do their x-ray spectral analysis within sherpa with Bayesian methods. Except how to use it and an instruction of avoiding the LM, without references, it does not describe its objective clearly from my viewpoint. I thought people might be turned away after reading it (I began to learn python to get around these ambiguity and implement Bayesian strategies on my own. Ciao 4.0 and sherpa accommodates python in addition to S-lang).

Thank you,Tom, for your explanation about what “bayes” in sherpa does and why the LM algorithm does not work with “bayes.” Particularly, related to the latter, it all comes to an objective function and how it is defined that determines those algorithms. Depending on the shape of objective functions, strategies must be changed. As a statistician, I rather like to work on robust one and make it work for spectral fitting. Instead of saying LM does not work, I’d like to give a reason why it does not work. However, not knowing what’s inside – “bayes” didn’t explain nor pointed references – I was curious. I hope the documentation you are preparing to be finished soon and “bayes” could hold better explanation about its function and shed more information about Bayesian statistics.

]]>
By: TomLoredo http://hea-www.harvard.edu/AstroStat/slog/2008/it-bothers-me/comment-page-1/#comment-823 TomLoredo Sat, 06 Dec 2008 19:34:09 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=1232#comment-823 Hyunsook asks: <em>I would like to know why it’s not working with Levenberg-Marquardt (LM)</em> The LM algorithm uses the form of the chi**2 function to develop an approximation to derivatives of the fitting function, used to guide steps to improve the fit. Since "bayes" changes the fit function to something that is not in the chi**2 form (sum of weighted squared differences between data and model), standard LM can't work with it (nor can it work with other non-Gaussian likelihoods). Put another way, LM is not a generic optimization algorithm; it is specifically tailored to chi**2 minimization. Powell is a generic algorithm, so it can work with the "bayes" marginal likelihood. Also, in case it wasn't clear from my earlier comment, the "bayes" marginal likelihood does not <em>subtract</em> the background; it marginalizes over it (analytically). If you follow the same procedure for Gaussian noise, it just so happens that the result can be expressed in terms of subtracting a background estimate, but that is just a convenient "accident" that comes from the form of the Gaussian. From a Bayesian point of view, the right thing to do is <em>always</em> to marginalize an uncertain background, not to subtract off a background estimate. The only reference for the "bayes" algorithm is my paper in the first SCMA volume (<a href="http://adsabs.harvard.edu/abs/1992scma.conf..275L" rel="nofollow">ADS link</a>), though I'm working on a more complete description of it (and some related algorithms). It's also described in most of my CASt summer school lectures. I believe Harrison Prosper independently derived a similar algorithm (for particle physics applications) around the same time. Finally, the quadrature version of the CHASC Bayesian hardness ratio work I think uses a similar algorithm as a first step; I think it's described in the Appendix to that paper. Hyunsook asks:

I would like to know why it’s not working with Levenberg-Marquardt (LM)

The LM algorithm uses the form of the chi**2 function to develop an approximation to derivatives of the fitting function, used to guide steps to improve the fit. Since “bayes” changes the fit function to something that is not in the chi**2 form (sum of weighted squared differences between data and model), standard LM can’t work with it (nor can it work with other non-Gaussian likelihoods). Put another way, LM is not a generic optimization algorithm; it is specifically tailored to chi**2 minimization. Powell is a generic algorithm, so it can work with the “bayes” marginal likelihood.

Also, in case it wasn’t clear from my earlier comment, the “bayes” marginal likelihood does not subtract the background; it marginalizes over it (analytically). If you follow the same procedure for Gaussian noise, it just so happens that the result can be expressed in terms of subtracting a background estimate, but that is just a convenient “accident” that comes from the form of the Gaussian. From a Bayesian point of view, the right thing to do is always to marginalize an uncertain background, not to subtract off a background estimate.

The only reference for the “bayes” algorithm is my paper in the first SCMA volume (ADS link), though I’m working on a more complete description of it (and some related algorithms). It’s also described in most of my CASt summer school lectures. I believe Harrison Prosper independently derived a similar algorithm (for particle physics applications) around the same time. Finally, the quadrature version of the CHASC Bayesian hardness ratio work I think uses a similar algorithm as a first step; I think it’s described in the Appendix to that paper.

]]>
By: TomLoredo http://hea-www.harvard.edu/AstroStat/slog/2008/it-bothers-me/comment-page-1/#comment-822 TomLoredo Sat, 06 Dec 2008 19:21:41 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=1232#comment-822 Hyunsook: I don't use Sherpa, but I believe the "bayes" option is based on an algorithm of mine that Peter Freeman helped put into Sherpa; if that's true, perhaps these comments will be helpful. I think you are right to be bothered, because the documentation is misleading. "bayes" does not implement a "Bayesian maximum likelihood function." First off, it doesn't maximize anything; it defines the form of the fit statistic Sherpa uses to be a <em>marginal</em> likelihood function. It assumes there is a background rate that is <em>constant in time and space</em>, so that off-source data can be presumed to measure the background rate that is present for the on-source spectrum. Then, bin by bin, it marginalizes the background rate out of the likelihood function (using a flat prior for each bin). The result is a likelihood function for the source rates in each bin; when you use "bayes," your fits will maximize this marginal likelihood function. Part of your confusion about this is the use of the word "constant." To a frequentist statistician, it immediately implies a quantity that is not random (and thus cannot be legitimately assigned a frequency distribution). From the Bayesian point of view, what is distributed in a probability distribution p(x) is the <em>probability</em>, p (it's distributed over the possible values of x, much like a matter density, rho(x)), not the argument, x (considered to be distributed over its possible values in many repeated observations, in the frequentist interpretation) . So it is legitimate to talk about the probability distribution for something believed to be constant. Brian: There is nothing the least bit controversial in your statement that Bayesian inference is broader than Bayes's theorem, the latter just being one of the arsenal of tools used in Bayesian calculations. For many years in my own tutorial lectures on this (e.g., as archived in the CASt summer school lectures), I explicitly emphasize that Bayesian inference uses <em>all</em> of probability theory, the important distinction being that it calculates probabilities <em>for hypotheses</em>. In fact, in my lectures, after deriving Bayes's theorem, I also derive the law of total probability (the marginalization rule), and state that, in my own applications, I wind up using it more often than Bayes's theorem. Anyway, I think you'd be hard-pressed to find someone who really uses Bayesian methods who would disagree with your insight! Hyunsook: I don’t use Sherpa, but I believe the “bayes” option is based on an algorithm of mine that Peter Freeman helped put into Sherpa; if that’s true, perhaps these comments will be helpful. I think you are right to be bothered, because the documentation is misleading. “bayes” does not implement a “Bayesian maximum likelihood function.” First off, it doesn’t maximize anything; it defines the form of the fit statistic Sherpa uses to be a marginal likelihood function. It assumes there is a background rate that is constant in time and space, so that off-source data can be presumed to measure the background rate that is present for the on-source spectrum. Then, bin by bin, it marginalizes the background rate out of the likelihood function (using a flat prior for each bin). The result is a likelihood function for the source rates in each bin; when you use “bayes,” your fits will maximize this marginal likelihood function.

Part of your confusion about this is the use of the word “constant.” To a frequentist statistician, it immediately implies a quantity that is not random (and thus cannot be legitimately assigned a frequency distribution). From the Bayesian point of view, what is distributed in a probability distribution p(x) is the probability, p (it’s distributed over the possible values of x, much like a matter density, rho(x)), not the argument, x (considered to be distributed over its possible values in many repeated observations, in the frequentist interpretation) . So it is legitimate to talk about the probability distribution for something believed to be constant.

Brian: There is nothing the least bit controversial in your statement that Bayesian inference is broader than Bayes’s theorem, the latter just being one of the arsenal of tools used in Bayesian calculations. For many years in my own tutorial lectures on this (e.g., as archived in the CASt summer school lectures), I explicitly emphasize that Bayesian inference uses all of probability theory, the important distinction being that it calculates probabilities for hypotheses. In fact, in my lectures, after deriving Bayes’s theorem, I also derive the law of total probability (the marginalization rule), and state that, in my own applications, I wind up using it more often than Bayes’s theorem. Anyway, I think you’d be hard-pressed to find someone who really uses Bayesian methods who would disagree with your insight!

]]>
By: brianISU http://hea-www.harvard.edu/AstroStat/slog/2008/it-bothers-me/comment-page-1/#comment-821 brianISU Thu, 04 Dec 2008 00:51:18 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=1232#comment-821 I would also like to point out the quote "We can relate this likelihood to the Bayesian posterior density for S(i) and B(i) using Bayes’ Theorem:". I am not going to say that this quote is wrong but that the posterior can be derived without Bayes theorem. Posteriors can be derived just using definitions of conditional probabilities. It seems that whenever Bayesian statistics is mentioned it is automatically assumed that all posteriors come from Bayes' theorem and that Bayes' theorem is the backbone of Bayesian statistics (this was my believe as well when I first was introduced to the Bayesian paradigm). I do not believe that this is true and believe more that the Bayesian statistics backbone is epistemic probability. I just thought I would add this for fun and to generate more discussion. In particular I am really interested in other viewpoints of this. I would also like to point out the quote “We can relate this likelihood to the Bayesian posterior density for S(i) and B(i)
using Bayes’ Theorem:”. I am not going to say that this quote is wrong but that the posterior can be derived without Bayes theorem. Posteriors can be derived just using definitions of conditional probabilities. It seems that whenever Bayesian statistics is mentioned it is automatically assumed that all posteriors come from Bayes’ theorem and that Bayes’ theorem is the backbone of Bayesian statistics (this was my believe as well when I first was introduced to the Bayesian paradigm). I do not believe that this is true and believe more that the Bayesian statistics backbone is epistemic probability. I just thought I would add this for fun and to generate more discussion. In particular I am really interested in other viewpoints of this.

]]>