The AstroStat Slog » Bayesian

coin toss with a twist

vlk — Sun, 26 Dec 2010 22:27:50 +0000

Here’s a cool illustration of how to use Bayesian analysis in the limit of very little data, when inferences are necessarily dominated by the prior. The question, via Tom Moertel, is: suppose I tell you that a coin always comes up heads, and you proceed to toss it and it does come up heads — how much more do you believe me now?

He also has the answer worked out in detail.

(h/t Doug Burke)

[Q] Objectivity and Frequentist Statistics

vlk — Mon, 29 Sep 2008 06:15:14 +0000

Is there an objective method to combine measurements of the same quantity obtained with different instruments?

Suppose you have a set of N₁ measurements obtained with one detector, and another set of N₂ measurements obtained with a second detector. And let’s say you wanted something as simple as an estimate of the mean of the quantity (say the intensity) being measured. Let us further stipulate that the measurement errors of each of the points is similar in magnitude and neither instrument displays any odd behavior. How does one combine the two datasets without appealing to subjective biases about the reliability or otherwise of the two instruments?

We’ve mentioned this problem before, but I don’t think there’s been a satisfactory answer.

The simplest thing to do would be to simply pool all the measurements into one dataset with N=N₁+N₂ measurements and compute the mean that way. But if the number of points in each dataset is very different, the simple combined sample mean is actually a statement of bias in favor of the dataset with more measurements.

In a Bayesian context, there seems to be at least a well-defined prescription: define a model, compute the posterior probability density for the model parameters using dataset 1 using some non-informative prior, use this posterior density as the prior density in the next step, where a new posterior density is computed from dataset 2.

What does one do in the frequentist universe?

[Update 9/30] After considerable discussion, it seems clear that there is no way to do this without making some assumption about the reliability of the detectors. In other words, disinterested objectivity is a mirage.

Quote of the Date

vlk — Tue, 01 Apr 2008 16:46:03 +0000

Really, there is no point in extracting a sentence here and there, go read the whole thing:

“Why I don’t like Bayesian Statistics”

- Andrew Gelman

Oh, alright, here’s one:

I can’t keep track of what all those Bayesians are doing nowadays–unfortunately, all sorts of people are being seduced by the promises of automatic inference through the “magic of MCMC”–but I wish they would all just stop already and get back to doing statistics the way it should be done, back in the old days when a p-value stood for something, when a confidence interval meant what it said, and statistical bias was something to eliminate, not something to embrace.

And before you panic, note the date.

Books – a boring title

hlee — Fri, 25 Jan 2008 16:53:21 +0000

I have been observing some sorts of misconception about statistics and statistical nomenclature evolution in astronomy, which I believe, are attributed to the lack of references in the astronomical society. There are some textbooks designed for junior/senior science and engineering students, which are likely unknown to astronomers. Example-wise, these books are not suitable, to my knowledge. Although I never expect astronomers to learn standard graduate (mathematical) statistics textbooks, I do wish astronomers go beyond Numerical Recipes (W. H. Press, S. A. Teukolsky, W. T. Vetterling, & B. P. Flannery) and Error Data Reduction and Analysis for the Physical Sciences (P. R. Bevington & D. K. Robinson). Here are some good ones written by astronomers, engineers, and statisticians:

The motivation of writing this posting was originated to Vinay’s recommendation: Practical Statistics for Astronomers (J.V.Wall and C.R.Jenkins), which provides many statistical insights and caveats that astronomers tend to ignore. Without looking at the error distribution and the properties of data, astronomers jump into chi-square and correlation. If someone reads the book, he/she will be careful on adopting statistics of common practice in astronomy, developed many decades ago, and founded on strong assumptions, not compatible with modern data sets. The book addresses many concerns that have been growing in my mind for astronomers and introduces various statistical methods applicable in astronomy.

The view points of astronomers without in-class statistics education but with full readership of this book, would be different from mine. The book mentioned unbiasedness, consistency, closedness, and robustness of statistics, which normally are not discussed nor proved in astronomy papers. Therefore, those readers may miss the insights, caveats, and contents-between-the-lines of the book, which I care about. To reduce such gap, as for quick and easy understanding of classical statistics, I recommend Cartoon Guide to Statistics (Larry Gonick, Woollcott Smith Business & Investing Collins) as a first step. This cartoon book enhances fundamentals in statistics only with fun and a friendly manner, and provides everything that rudimentary textbooks offer.

If someone wants to know beyond classical statistics (so called frequentist statistics) and likes to know popular Bayesian statistics, astronomy professor Phil Gregory’s Bayesian Logical Data Analysis for the Physical Sciences is recommended. If one likes to know little bit more on the modern statistics of frequentists and Bayesians, All of Statistics (Larry Wasserman) is recommended. I realize that textbooks for non-statistics students are too thick to go through in a short time (The book for senior engineering students at Penn State I used for teaching was Probability and Statistics for Engineering and the Sciences by Jay. L Devore, 4th and 5th edition and it was about 600 pages. The current edition is 736 pages). One of well received textbooks for graduate students in electrical engineering is Probability, Random Variables and Stochastic Processes (A. Papoulis & S.U. Pillai). I remember the book offers a rather less abstract definition of measure and practical examples (Personally, Hermite polynomials was useful from the book).

For a casual reading about statistics and its 20th century history, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century (D. Salsburg) is quite nice.

Statistics is not just for best fit analysis and error bars. It is a wonderful telescope extracts correct information when it is operated carefully to the right target by the manual. It gets rid of atmospheric and other blurring factors when statistics is understood righteously. It is not a black box nor a magic, as many people think.

The era of treating everything gaussian is over decades ago. Because of the central limit theorem and the delta method (a good example is log-transformation), many statistics asymptotically follows the normal (gaussian) distribution but there are various families of distributions. Because of possible bias in the chi-square method, the error bar cannot guarantee the appointed coverage, like 95%. There are also nonparametric statistics, known for robustness, whereas it may be less efficient than statistics of distribution family assumption. Yet, it does not require model assumption. Also, Bayesian statistics works wonderfully if correct information on priors, suitable likelihood models, and computing powers for hierarchical models and numerical integration are provided.

Before jumping into the chi-square for fitting and testing at the same time, to prevent introducing bias, exploratory data analysis is required for better understanding data and for seeking a suitable statistic and its assumptions. The exploratory data analysis starts from simple scatter plots and box plots. A little statistical care for data and good interests in the truth of statistical methods are all I am asking for. I do wish that these books could assist the realization of my wishes.

—————————————————————————-
[1.] Most of links to books are from amazon.com but there is no personal affiliation to the company.

[2.] In addition to the previous posting on chi-square, what is so special about chi square in astronomy, I’d like to mention possible bias in chi-square fitting and testing. It is well known that utilizing the same data set for fitting, which results in parameter estimates so called in astronomy best fit values and error bars, and testing based on these parameter estimates brings out bias so that the best fit is biased from the true parameter value and the error bar does not match the aimed coverage. See the problem from Aneta’s an example of chi2 bias in fitting x-ray spectra

[3.] More book recommendation is welcome.

[ArXiv] 3rd week, Jan. 2008

hlee — Fri, 18 Jan 2008 18:24:23 +0000

Seven preprints were chosen this week and two mentioned model selection.

[astro-ph:0801.2186] Extrasolar planet detection by binary stellar eclipse timing: evidence for a third body around CM Draconis H.J.Deeg (it discusses model selection in section 4.4)
[astro-ph:0801.2156] Modeling a Maunder Minimum A. Brandenburg & E. A. Spiegel (it could be useful for those who does sunspot cycle modeling)
[astro-ph:0801.1914] A closer look at the indications of q-generalized Central Limit Theorem behavior in quasi-stationary states of the HMF model A. Pluchino, A. Rapisarda, & C. Tsallis
[astro-ph:0801.2383] Observational Constraints on the Dependence of Radio-Quiet Quasar X-ray Emission on Black Hole Mass and Accretion Rate B.C. Kelly et.al.
[astro-ph:0801.2410] Finding Galaxy Groups In Photometric Redshift Space: the Probability Friends-of-Friends (pFoF) Algorithm I. Li & H. K.C. Yee
[astro-ph:0801.2591] Characterizing the Orbital Eccentricities of Transiting Extrasolar Planets with Photometric Observations E. B. Ford, S. N. Quinn, &D. Veras
[astro-ph:0801.2598] Is the anti-correlation between the X-ray variability amplitude and black hole mass of AGNs intrinsic? Y. Liu & S. N. Zhang

[Quote] Bootstrap and MCMC

hlee — Tue, 01 Jan 2008 00:48:59 +0000

The Bootstrap and Modern Statistics Brad Efron (2000), JASA Vol. 95 (452), p. 1293-1296.

If the bootstrap is an automatic processor for frequentist inference, then MCMC is its Bayesian counterpart.

Sometime in my second year of studying statistics, I said that bootstrap and MCMC are equivalent but reflect different streams in statistics. The response to this comment was ‘that’s nonsense.’ Although I forgot details of the circumstance, I was hurt and didn’t try to prove myself. After years, the occasion immediately floats on the surface upon seeing this sentence.

[ArXiv] 3rd week, Dec. 2007

hlee — Fri, 21 Dec 2007 18:40:09 +0000

The paper about the Banff challenge [0712.2708] and the statistics tutorial for cosmologists [0712.3028] are the personal recommendations from this week’s [arXiv] list. Especially, I’d like to quote from Licia Verde’s [astro-ph:0712.3028],

In general, Cosmologists are Bayesians and High Energy Physicists are Frequentists.

I thought it was opposite. By the way, if you crave for more papers, click

[astro-ph:0712.2544]
RHESSI Microflare Statistics II. X-ray Imaging, Spectroscopy & Energy Distributions I. G. Hannah et.al.
[stat.AP;0712.2708]
The Banff Challenge: Statistical Detection of a Noisy Signal A. C. Davison & N. Sartori
[astro-ph:0712.2898]
A study of supervised classification of Hipparcos variable stars using PCA and Support Vector Machines P.G. Willemsen & L. Eyer
[astro-ph:0712.2961]
The frequency distribution of the height above the Galactic plane for the novae M. Burlak
[astro-ph:0712.3028]
A practical guide to Basic Statistical Techniques for Data Analysis in Cosmology L. Verde
[astro-ph:0712.3049]
ZOBOV: a parameter-free void-finding algorithm M. C. Neyrinck
[stat.CO:0712.3056]
Gibbs Sampling for a Bayesian Hierarchical Version of the General Linear Mixed Model A. A. Johnson & G L. Jones

[ArXiv] 4th week, Oct. 2007

hlee — Fri, 26 Oct 2007 22:52:02 +0000

I hope there are a paper or two drags your attentions and stimulates your thoughts in astrostatistics from arXiv.

[stat.ML:0710.3742]
Bayesian Online Change Point Detection by R. Adams and D. MacKay
[astro-ph:0710.3600]
Statistical Methods for Investigating the Cosmic Ray Energy Spectrum by J. Hague, B. Becker, M. Gold, J.Matthews, and J. Urb\’a\v{r}
[astro-ph:0710.3618]
Fast algorithms for matching CCD images to a stellar catalogue by V. Tabur
[astro-ph:0710.4019]
A principal component analysis approach to the morphology of Plaetary Nebulae by S. Akras and P. Boumis
[astro-ph:0710.4020]
Dice and Pulsars by V. M. Kontorovich
[astro-ph:0710.4075]
Getting More From Your Multicore: Exploiting OpenMP for Astronomy by M. S. Noble
[astro-ph:0710.4143]
Lensing and Supernovae: Quantifying The Bias on the Dark Energy Equation of State by D. Sarkar and A. Amblard
[astro-ph:0710.4158]
A Cross-Match of 2MASS and SDSS: Newly-Found L and T Dwarfs and an Estimate of the Space Density of T Dwarfs by S. Metchev, et. al.
[astro-ph:0710.4262]
Crowded-Field Astrometry with the Space Interferometry Mission – I. Estimating the Single-Measurement Astrometric Bias Arising from Confusion by R. Sridharan and R. Allen
[astro-ph:0710.4556]
X-Ray Binaries and the Current Dynamical States of Galactic Globular Clusters by J. M. Fregeau
[stat.ME]
The Use of Unlabeled Data in Predictive Modeling by F. Liang, S. Mukherjee, and M. West

ab posteriori ad priori

vlk — Sat, 29 Sep 2007 22:03:57 +0000

A great advantage of Bayesian analysis, they say, is the ability to propagate the posterior. That is, if we derive a posterior probability distribution function for a parameter using one dataset, we can apply that as the prior when a new dataset comes along, and thereby improve our estimates of the parameter and shrink the error bars.

But how exactly does it work? I asked this of Tom Loredo in the context of some strange behavior of sequential applications of BEHR that Ian Evans had noticed (specifically that sequential applications of BEHR, using as prior the posterior from the preceding dataset, seemed to be dependent on the order in which the datasets were considered (which, as it happens, arose from approximating the posterior distribution before passing it on as the prior distribution to the next stage — a feature that now has been corrected)), and this is what he said:

Yes, this is a simple theorem. Suppose you have two data sets, D1 and D2, hypotheses H, and background info (model, etc.) I. Considering D2 to be the new piece of info, Bayes’s theorem is:

[1]
p(H|D1,D2) = p(H|D1) p(D2|H, D1)            ||  I
             -------------------
                    p(D2|D1)
where the “|| I” on the right is the “Skilling conditional” indicating that all the probabilities share an “I” on the right of the conditioning solidus (in fact, they also share a D1).

We can instead consider D1 to be the new piece of info; BT then reads:

[2]
p(H|D1,D2) = p(H|D2) p(D1|H, D2)            ||  I
             -------------------
                    p(D1|D2)
Now go back to [1], and use BT on the p(H|D1) factor:
p(H|D1,D2) = p(H) p(D1|H) p(D2|H, D1)            ||  I
             ------------------------
                    p(D1) p(D2|D1)

           = p(H, D1, D2)
             ------------      (by the product rule)
                p(D1,D2)
Do the same to [2]: use BT on the p(H|D2) factor:
p(H|D1,D2) = p(H) p(D2|H) p(D1|H, D2)            ||  I
             ------------------------
                    p(D2) p(D1|D2)

           = p(H, D1, D2)
             ------------      (by the product rule)
                p(D1,D2)
So the results from the two orderings are the same. In fact, in the Cox-Jaynes approach, the “axioms” of probability aren’t axioms, but get derived from desiderata that guarantee this kind of internal consistency of one’s calculations. So this is a very fundamental symmetry.

Note that you have to worry about possible dependence between the data (i.e., p(D2|H, D1) appears in [1], not just p(D2|H)). In practice, separate data are often independent (conditional on H), so p(D2|H, D1) = p(D2|H) (i.e., if you consider H as specified, then D1 tells you nothing about D2 that you don’t already know from H). This is the case, e.g., for basic iid normal data, or Poisson counts. But even in these cases dependences might arise, e.g., if there are nuisance parameters that are common for the two data sets (if you try to combine the info by multiplying *marginalized* posteriors, you may get into trouble; you may need to marginalize *after* multiplying if nuisance parameters are shared, or account for dependence some other way).

what if you had 3, 4, .. N observations? Does the order in which you apply BT affect the results?

No, as long as you use BT correctly and don’t ignore any dependences that might arise.

if not, is there a prescription on what is the Right Thing [TM] to do?

Always obey the laws of probability theory! 9-)

When you observed zero counts, you didn’t not observe any counts

vlk — Mon, 24 Sep 2007 00:28:15 +0000

Dong-Woo, who has been playing with BEHR, noticed that the confidence bounds quoted on the source intensities seem to be unchanged when the source counts are zero, regardless of what the background counts are set to. That is, p(s|N_S,N_B) is invariant when N_S=0, for any value of N_B. This seems a bit odd, because [naively] one expects that as N_B increases, it should/ought to get more and more likely that s gets closer to 0.

Suppose you compute the posterior probability distribution of the intensity of a source, s, when the data include counts in a source region (N_S) and counts in a background region (N_B). When N_S=0, i.e., no counts are observed in the source region,

p(s|N_S=0, N_B) = (1+b)^a/Gamma(a) * s^a-1 * e^-s*(1+b),

where a,b are the parameters of a gamma prior.

Why does N_B have no effect? Because when you have zero counts, the entire effect of the background is going towards evaluating how good the actual chosen model is (so it is become a model comparison problem, not a parameter estimation one), and not into estimating the parameter of interest, the source intensity. That is, into the normalization factor of the probability distribution, p(N_S,N_B). Those parts that depend on N_B cancel out when the expression for p(s|N_S,N_B) is written out because the shape is independent of N_B and the pdf must integrate to 1.

No doubt this is obvious, but I hadn’t noticed it before.

PS: Also shows why upper limits should not be identified with upper confidence bounds.