The AstroStat Slog » prior

The chance that A has nukes is p%

hlee — Fri, 23 Oct 2009 17:26:07 +0000

I watched a movie in which one of the characters said, “country A has nukes with 80% chance” (perhaps, not 80% but it was a high percentage). One of the statements in that episode is that people will not eat lettuce only if the 1% chance of e coli is reported, even lower. Therefore, with such a high percentage of having nukes, it is right to send troops to A. This episode immediately brought me a thought about astronomers’ null hypothesis probability and their ways of concluding chi-square goodness of fit tests, likelihood ratio tests, or F-tests.

First of all, I’d like to ask how you would like to estimate the chance of having nukes in a country? What this 80% implies here? But, before getting to the question, I’d like to discuss computing the chance of e coli infection, first.

From the frequentists perspective, computing the chance of e coli infection is investigating sample of lettuce and counts species that are infected: n is the number of infected species and N is the total sample size. 1% means one among 100. Such percentage reports and their uncertainties are very familiar scene during any election periods for everyone. From Bayesian perspective, Pr(p|D) ~ L(D|p) pi(p), properly choosing likelihoods and priors, one can estimate the chance of e coli infection and uncertainty. Understanding of sample species and a prior knowledge helps to determine likelihoods and priors.

How about the chance that country A has nukes? Do we have replicates of country A so that a committee investigate each country and count ones with nukes to compute the chance? We cannot do that. Traditional frequentist approach, based on counting, does not work here to compute the chance. Either using fiducial likelihood approach or Bayesian approach, i.e. carefully choosing a likelihood function adequately (priors are only for Bayesian) allows one to compuate such probability of interest. In other words, those computed chances highly depend on the choice of model and are very subjective.

So, here’s my concern. It seems like that astronomers want to know the chance of their spectral data being described by a model (A*B+C)*D (each letter stands for one of models such as listed in Sherpa Models). This is more like computing the chance of having nukes in country A, not counting frequencies of the event occurrence. On the other hand, p-value from goodness of fit tests, LRTs, or F-tests is a number from the traditional frequentists’ counting approach. In other words, p-value accounts for, under the null hypothesis (the (A*B+C)*D model is the right choice so that residuals are Gaussian), how many times one will observe the event (say, reduced chi^2 >1.2) if the experiments are done N times. The problem is that we only have one time experiment and that one spectrum to verify the (A*B+C)*D is true. Goodness of fit or LRT only tells the goodness or the badness of the model, not the statistically and objectively quantified chance.

In order to know the chance of the model (A*B+C)*D, like A has nuke with p%, one should not rely on p-values. If you have multiple models, one could compute pairwise relative chances i.e. odds ratios, or Bayes factors. However this does not provide the uncertainty of the chance (astronomers have the tendency of reporting uncertainties of any point estimates even if the procedure is statistically meaningless and that quantified uncertainty is not statistical uncertainty, as in using delta chi^2=1 to report 68% confidence intervals). There are various model selection criteria that cater various conditions embedded in data to make a right model choice among other candidate models. In addition, post-inference for astronomical models is yet a very difficult problem.

In order to report the righteous chance of (A*B+C)*D requires more elaborated statistical modeling, always brings some fierce discussions between frequentists and Bayesian because of priors and likelihoods. Although it can be very boring process, I want astronomers to leave the problem to statisticians instead of using inappropriate test statistics and making creative interpretation of statistics.

Please, keep this question in your mind when you report probability: what kind of chance are you computing? The chance of e coli infection? Or the chance that A has nukes? Make sure to understand that p-values from data analysis packages does not tell you that the chance the model (A*B+C)*D is (one minus p-value)%. You don’t want to report one minus p-value from a chi-square test statistic as the chance that A has nukes.

gamma function (Equation of the Week)

vlk — Tue, 06 May 2008 22:12:45 +0000

The gamma function [not the Gamma -- note upper-case G -- which is related to the factorial] is one of those insanely useful functions that after one finds out about it, one wonders “why haven’t we been using this all the time?” It is defined only on the ~~positive~~ non-negative real line, is a highly flexible function that can emulate almost any kind of skewness in a distribution, and is a perfect complement to the Poisson likelihood. In fact, it is the conjugate prior to the Poisson likelihood, and is therefore a natural choice for a prior in all cases that start off with counts.

The gamma function is defined with two parameters, alpha, and beta, over the ~~+ve~~ non-negative real line. alpha can be any real number greater than 1 unlike the Poisson likelihood where the equivalent quantity are integers (values less than 1 are possible, but the function ceases to be integrable) and beta is any number greater than 0.

The mean is alpha/beta and the variance is alpha/beta². Conversely, given a sample whose mean and variance are known, one can estimate alpha and beta to describe that sample with this function.

This is reminiscent of the Poisson distribution where alpha ~ number of counts and beta is akin to the collecting area or the exposure time. For this reason, a popular non-informative prior to use with the Poisson likelihood is gamma(alpha=1,beta=0), which is like saying “we expect to detect 0 counts in 0 time”. (Which, btw, is not the same as saying we detect 0 counts in an observation.) [Edit: see Tom Loredo's comments below for more on this.] Surprisingly, you can get less informative that even that, but that’s a discussion for another time.

Because it is the conjugate prior to the Poisson, it is also a useful choice to use as an informative prior. It makes derivations of formulae that much easier, though one has to be careful about using it blindly in real world applications, as the presence of background can muck up the pristine Poissonness of the prior (as we discovered while applying BEHR to Chandra Level3 products).

[ArXiv] A fast Bayesian object detection

hlee — Wed, 05 Mar 2008 21:46:48 +0000

This is a quite long paper that I separated from [Arvix] 4th week, Feb. 2008:
[astro-ph:0802.3916] P. Carvalho, G. Rocha, & M.P.Hobso
A fast Bayesian approach to discrete object detection in astronomical datasets – PowellSnakes I
As the title suggests, it describes Bayesian source detection and provides me a chance to learn the foundation of source detection in astronomy.

First, I’d like to point out that my initial concerns from [astro-ph:0707.1611] Probabilistic Cross-Identification of Astronomical Sources are explained in sections 2, 3 and 6 about parameter space, its dimensionality, and priors in Bayesian model selection.

Second, I’d rather concisely list the contents of the paper as follows: (1) priors, various types but rooms were left for further investigations in future; (2) templates (such as point spread function, I guess), crucial for defining sources, and gaussian random field for noise; (3) optimization strategies for fast computation (source detection implies finding maxima and integration for evidence); (4) comparison with other works; (5) upper bound, tuning the threshold for acceptance/rejection to minimize the symmetric loss; (6) challenges of dealing likelihoods in Fourier space from incorporating colored noise (opposite to white noise); (7) decision theory from computing false negatives (undetected objects) and false positives (spurious objects). Many issues in computing Bayesian evidence, priors, tunning parameter relevant posteriors, and the peaks of maximum likelihoods; and approximating templates and backgrounds are carefully presented. The conclusion summarizes their PowellSnakes algorithm pictorially.

Thirdly, although my understanding of object detection and linking it to Bayesian techniques is very superficial, my reading this paper tells me that they propose some clever ways of searching full 4 dimensional space via Powell minimization (It seems to be related with profile likelihoods for a fast computation but it was not explicitly mentioned) and the detail could direct statisticians’ attentions for the improvement of computing efficiency and acceleration.

Fourth, I’d like to talk about my new knowledge that I acquired from this paper about errors in astronomy. Statisticians usually surprise at astronomical catalogs that in general come with errors next to single measurements. These errors are not measurement errors (errors calculated from repeated observations) but obtained from Fisher information owing to Cramer-Rao Lower Bound. The template likelihood function leads this uncertainty measure on each observation.

Lastly, in astronomy, there are many empirical rules, effects, and laws that bear uncommon names. Generally these are sophisticated rules of thumb or approximations of some phenomenon (for instance, Hubble’s law, though it’s well known) but they have been the driving away factors when statisticians reading astronomy papers. On the other hand, despite overwhelming names, when it gets to the point, the objective of mentioning such names is very statistical like regression (fitting), estimating parameters and their uncertainty, goodness-of-fit, truncated data, fast optimization algorithms, machine learning, etc. This paper mentions Sunyaev-Zel’dovich effect, which name scared me but I’d like to emphasize that this kind nomenclature may hinder from understanding details but could not block any collaborations.

[ArXiv] 4th week, Nov. 2007

hlee — Sat, 24 Nov 2007 13:26:40 +0000

A piece of thought during my stay in Korea: As not many statisticians are interested in modern astronomy while they look for data driven problems, not many astronomers are learning up to date statistics while they borrow statistics in their data analysis. The frequency is quite low in astronomers citing statistical journals as little as statisticians introducing astronomical data driven problems. I wonder how other fields lowered such barriers decades ago.

No matter what, there are preprints from this week that may help to shrink the chasm.

[stat.ME:0711.3236]
Confidence intervals in regression utilizing prior information P. Kabaila and K. Giri
[stat.ME:0711.3271]
Computer model validation with functional output M. J. Bayarri, et. al.
[astro-ph:0711.3266]
Umbral Fine Structures in Sunspots Observed with Hinode Solar Optical Telescope R. Kitai, et.al.
[astro-ph:0711.2720]
Magnification Probability Distribution Functions of Standard Candles in a Clumpy Universe C. Yoo et.al.
[astro-ph:0711.3196]
Upper Limits from HESS AGN Observations in 2005-2007 HESS Collaboration: F. Aharonian, et al
[astro-ph:0711.2509]
Shrinkage Estimation of the Power Spectrum Covariance Matrix A. C. Pope and I. Szapudi
[astro-ph:0711.2631]
Statistical properties of extragalactic sources in the New Extragalactic WMAP Point Source (NEWPS) catalogue J. González-Nuevo, et. al.

ab posteriori ad priori

vlk — Sat, 29 Sep 2007 22:03:57 +0000

A great advantage of Bayesian analysis, they say, is the ability to propagate the posterior. That is, if we derive a posterior probability distribution function for a parameter using one dataset, we can apply that as the prior when a new dataset comes along, and thereby improve our estimates of the parameter and shrink the error bars.

But how exactly does it work? I asked this of Tom Loredo in the context of some strange behavior of sequential applications of BEHR that Ian Evans had noticed (specifically that sequential applications of BEHR, using as prior the posterior from the preceding dataset, seemed to be dependent on the order in which the datasets were considered (which, as it happens, arose from approximating the posterior distribution before passing it on as the prior distribution to the next stage — a feature that now has been corrected)), and this is what he said:

Yes, this is a simple theorem. Suppose you have two data sets, D1 and D2, hypotheses H, and background info (model, etc.) I. Considering D2 to be the new piece of info, Bayes’s theorem is:

[1]
p(H|D1,D2) = p(H|D1) p(D2|H, D1)            ||  I
             -------------------
                    p(D2|D1)
where the “|| I” on the right is the “Skilling conditional” indicating that all the probabilities share an “I” on the right of the conditioning solidus (in fact, they also share a D1).

We can instead consider D1 to be the new piece of info; BT then reads:

[2]
p(H|D1,D2) = p(H|D2) p(D1|H, D2)            ||  I
             -------------------
                    p(D1|D2)
Now go back to [1], and use BT on the p(H|D1) factor:
p(H|D1,D2) = p(H) p(D1|H) p(D2|H, D1)            ||  I
             ------------------------
                    p(D1) p(D2|D1)

           = p(H, D1, D2)
             ------------      (by the product rule)
                p(D1,D2)
Do the same to [2]: use BT on the p(H|D2) factor:
p(H|D1,D2) = p(H) p(D2|H) p(D1|H, D2)            ||  I
             ------------------------
                    p(D2) p(D1|D2)

           = p(H, D1, D2)
             ------------      (by the product rule)
                p(D1,D2)
So the results from the two orderings are the same. In fact, in the Cox-Jaynes approach, the “axioms” of probability aren’t axioms, but get derived from desiderata that guarantee this kind of internal consistency of one’s calculations. So this is a very fundamental symmetry.

Note that you have to worry about possible dependence between the data (i.e., p(D2|H, D1) appears in [1], not just p(D2|H)). In practice, separate data are often independent (conditional on H), so p(D2|H, D1) = p(D2|H) (i.e., if you consider H as specified, then D1 tells you nothing about D2 that you don’t already know from H). This is the case, e.g., for basic iid normal data, or Poisson counts. But even in these cases dependences might arise, e.g., if there are nuisance parameters that are common for the two data sets (if you try to combine the info by multiplying *marginalized* posteriors, you may get into trouble; you may need to marginalize *after* multiplying if nuisance parameters are shared, or account for dependence some other way).

what if you had 3, 4, .. N observations? Does the order in which you apply BT affect the results?

No, as long as you use BT correctly and don’t ignore any dependences that might arise.

if not, is there a prescription on what is the Right Thing [TM] to do?

Always obey the laws of probability theory! 9-)

[ArXiv] Bayesian Star Formation Study, July 13, 2007

hlee — Mon, 16 Jul 2007 19:31:13 +0000

From arxiv/astro-ph:0707.2064v1
Star Formation via the Little Guy: A Bayesian Study of Ultracool Dwarf Imaging Surveys for Companions by P. R. Allen.

I rather skip all technical details on ultracool dwarfs and binary stars, reviews on star formation studies, like initial mass function (IMF), astronomical survey studies, which Allen gave a fair explanation in arxiv/astro-ph:0707.2064v1 but want to emphasize that based on simple Bayes’ rule and careful set-ups for likelihoods and priors according to data (ultracool dwarfs), quite informative conclusions were drawn:

the peak at q~1 is significant,
lack of companions with a distance greater than 15-20 A.U. (a unit indicates the distance between the Sun and the Earth),
less binary stars with later spectral types,
inconsistency of undetected low mass ratio systems to the current data, and
30% spectroscopic binaries are from ultracool binaries.

Before, asking for observational efforts for improvements, it is commented 75% as the the upper limit of the ultracool binary population.