The AstroStat Slog » arXiv

iFish in the archive

vlk — Tue, 10 Mar 2009 16:41:42 +0000

The iPhone App Store has a couple of apps that make life significantly easier for those of us inundated and overwhelmed by the stream of daily arXiv preprints. These are ArXivReader.app and ArXiv.app, both providing a means to browse and search the arXiv preprint database ~~and both selling for 99c~~ with the first selling for 99c and the second free. Check them out! The former even lets you save papers for off-line reading.

For me at least, the hardest part of going through the arXiv emails every day was to pick out the interesting papers in the deluge of text. These apps do the right thing and segregate the categories and highlight the titles. Fitts’ Law in action — suddenly the daily ritual is orders of magnitude more pleasant!

Did they, or didn’t they?

vlk — Tue, 20 May 2008 04:10:23 +0000

Earlier this year, Peter Edmonds showed me a press release that the Chandra folks were, at the time, considering putting out describing the possible identification of a Type Ia Supernova progenitor. What appeared to be an accreting white dwarf binary system could be discerned in 4-year old observations, coincident with the location of a supernova that went off in November 2007 (SN2007on). An amazing discovery, but there is a hitch.

And it is a statistical hitch, and involves two otherwise highly reliable and oft used methods giving contradictory answers at nearly the same significance level! Does this mean that the chances are actually 50-50? Really, we need a bona fide statistician to take a look and point out the errors of our ways..

The first time around, Voss & Nelemans (arXiv:0802.2082) looked at how many X-ray sources there were around the candidate progenitor of SN2007on (they also looked at 4 more galaxies that hosted Type Ia SNe and that had X-ray data taken prior to the event, but didn’t find any other candidates), and estimated the probability of chance coincidence with the optical position. When you expect 2.2 X-ray sources/arcmin² near the optical source, the probability of finding one within 1.3 arcsec is tiny, and in fact is around 0.3%. This result has since been reported in Nature.

However, Roelofs et al. (arXiv:0802.2097) went about getting better optical positions and doing better bore-sighting, and as a result, they measured the the X-ray position accurately and also carried out Monte Carlo simulations to estimate the error on the measured location. And they concluded that the actual separation, given the measurement error in the location, is too large to be a chance coincidence, 1.18±0.27 arcsec. The probability ~~that the two locations are the same~~ of finding offsets in the observed range is ~1% [see Tom's clarifying comment below].

Well now, ain’t that a nice pickle?

To recap: there are so few X-ray sources in the vicinity of the supernova that anything close to its optical position cannot be a coincidence, BUT, the measured error in the position of the X-ray source is not copacetic with the optical position. So the question for statisticians now: which argument do you believe? Or is there a way to reconcile these two calculations?

Oh, and just to complicate matters, the X-ray source that was present 4 years ago had disappeared when looked for in December, as one would expect if it was indeed the progenitor. But on the other hand, a lot of things can happen in 4 years, even with astronomical sources, so that doesn’t really confirm a physical link.

Is 8-sigma significant enough for you?

vlk — Thu, 24 Apr 2008 18:56:58 +0000

There is a new report from Bernabei et al. (arXiv:0804.2741) of the direct detection of the effects of Dark Matter that is causing a lot of buzz. (The Bad Astronomer has a good summary.) They find yearly modulation in their detected scintillation rate that matches what you would expect if the Earth were rushing through Galactic Dark Matter as it goes around the Sun. They have worked out the significance of the modulation to be 8.2 sigma. Significant! But significant of what?

I am no expert on this in a hundred and one different ways. But I feel kinda sorry for the DAMA group. Certainly, at 8-sigma, it is easy to accept that there is modulation. But is this modulation proof of Dark Matter? Astronomers in general are extremely suspicious of any yearly modulations, as we have learnt from hard experience that it is an extremely common source of systematic error. Essentially, the Earth is a poorly calibrated detector, and it has diurnal and annual cycles, and these invariably show up in everything. So when the signal you are looking for goes exactly like the first thing you try to catch and eliminate, what price statistical significance?

~ Avalanche(a,b)

vlk — Sun, 14 Oct 2007 01:14:21 +0000

Avalanches are a common process, occuring anywhere that a system can store stress temporarily without “snapping”. It can happen on sand dunes and solar flares as easily as on the snow bound Alps.

Melatos, Peralta, & Wyithe (arXiv:0710.1021) have a nice summary of avalanche processes in the context of pulsar glitches. Their primary purpose is to show that the glitches are indeed consistent with an avalanche, and along the way they give a highly readable description of what an avalanche is and what it entails. Briefly, avalanches result in event parameters that are distributed in scale invariant fashion (read: power laws) with exponential waiting time distributions (i.e., Poisson).

Hence the title of this post: the “Avalanche distribution” (indulge me! I’m using stats notation to bury complications!) can be thought to have two parameters, both describing the indices of power-law distributions that control the event sizes, a, and the event durations, b, and where the event separations are distributed as an exponential decay. Is there a canned statistical distribution that describes all this already? (In our work modeling stellar flares, we assumed that b=0 and found that ~~a>2~~ a<-2, which has all sorts of nice consequences for coronal heating processes.)

“you are biased, I have an informative prior”

vlk — Wed, 10 Oct 2007 16:26:27 +0000

Hyunsook drew attention to this paper (arXiv:0709.4531v1) by Brad Schaefer on the underdispersed measurements of the distances to LMC. He makes a compelling case that since 2002 published numbers in the literature have been hewing to an “acceptable number”, possibly in an unconscious effort to pass muster with their referees. Essentially, the distribution of the best-fit distances are much more closely clustered than you would expect from the quoted sizes of the error bars.

To be sure, there are other possible reasons for this underdispersion, such as correlations in how the data are gathered and analyzed, and an overly conservative estimation of error bars, etc. In fact, the most benign explanation is probably in how people carry out “sanity checks” and tend to discard or explain away or correct the data that give odd results.

While this is indeed worrisome, I am inclined to think that this is not wrong per se, but rather a case where a fully Bayesian analysis would give the “right” coverage. After all, there does exist a strong prior that people are bringing into the analysis, but are not including in the calculations of the widths of the posterior probability distributions. Including such a highly informative prior will of course shrink the sizes of the error bars and make everything consistent. i.e., I think that the assumption needs to be explicit, that is all. Is that bias? bandwagon? or prior belief?

Betraying your heritage

vlk — Thu, 20 Sep 2007 16:26:07 +0000

[arXiv:0709.3093v1] Short Timescale Coronal Variability in Capella (Kashyap & Posson-Brown)

We recently submitted that paper to AJ, and rather ironically, I did the analysis during the same time frame as this discussion was going on, about how astronomers cannot rely on repeating observations. Ironic because the result reported there hinges on the existence of small, but persistent signal that is found in repeated observations of the same source. Doubly ironic in fact, in that just as we were backing and forthing about cultural differences I seemed to have gone and done something completely contrary to my heritage!

btw, this paper is interesting because Capella is a strong X-ray source, and “everybody believes” that such sources should exhibit some variability, so finding such shouldn’t be a big deal, and yet Capella itself has been remarkably stable and had all this while defied the characterization and even the detection of such variability. Even now, the estimated magnitude of the variability fraction is rather small. It’s a good thing that we had some 22 counts/sec over 205 kiloseconds to play with.

Spurious Sources

vlk — Wed, 19 Sep 2007 18:21:57 +0000

[arXiv:0709.2358] Cleaning the USNO-B Catalog through automatic detection of optical artifacts, by Barron et al.

Statistically speaking, “false sources” are generally in the domain of ~~Type II~~ Type I errors, defined by the probability of detecting a signal where there is none. But what if there is a clear signal, but it is not real?

In astronomical analysis, sources are generally defined with reference to the existing background, as point-fluctuations that exceed some significance threshold defined by the estimated background “in the vicinity”. The threshold is usually set such that we can tolerate “a few” false positives at borderline significance. But that ignores the effect of systematic deviations that can be caused by various instrumental features. Such things are common in X-ray images — window support structures, chip gaps, bad CCD columns, cosmic-ray hits, etc. Optical data are generally cleaner, but by no means immune to the problem. Barron et al. here describe how they have gone through the USNO-B catalog and have modeled and eliminated artifacts coming from diffraction spikes and telescope reflection halos of bright stars.

The bad news? More than 2.3% of the sources are flagged as spurious. Compare to the typical statistical significance at which the detection thresholds are set (usually >3sigma).

Wrong Priors?

vlk — Mon, 10 Sep 2007 16:15:31 +0000

arXiv:0709.1067v1 : Wrong Priors (Carlos C. Rodriguez)

This came through today on astro-ph, suggesting that we could be choosing priors better than we do, and in fact that we generally do a very bad job of it. I have been brought up to believe that, like points in Whose Line Is It Anyway, priors don’t matter (unless you have very little data), so I am somewhat confused. What is going on here?

An alternative to MCMC?

vlk — Sun, 19 Aug 2007 04:31:09 +0000

I think of Markov-Chain Monte Carlo (MCMC) as a kind of directed staggering about, a random walk with a goal. (Sort of like driving in Boston.) It is conceptually simple to grasp as a way to explore the posterior probability distribution of the parameters of interest by sampling only where it is worth sampling from. Thus, a major savings from brute force Monte Carlo, and far more robust than downhill fitting programs. It also gives you the error bar on the parameter for free. What could be better?

Feroz & Hobson (2007, arXiv:0704.3704) describe a technique called Nested Sampling (Skilling 2004), one that could give MCMC a run for its money. It takes the one inefficient part of MCMC — the burn-in phase — and turns that into a virtue. The way it seems to work is to keep track of how the parameter space is traversed as the model parameters {theta} reach the mode of the posterior, and to take the sequence of likelihoods thus obtained L(theta), and turn it around to get theta(L). Neat.

Two big (computational) problems that I see are (1) the calculation of theta(L), and (2) the sampling to discard the tail of L(theta). The former, it seems to me, becomes intractable exactly when the likelihood surface gets complicated. The latter, again, it seems you have to run through just as many iterations as in MCMC to get a decent sample size. Of course, if you have a good theta(L), it does seem to be an improvement over MCMC in that you won’t need to run the chains multiple times to make sure you catch all the modes.

I think the main advantage of MCMC is that it produces and keeps track of marginalized posteriors for each parameter, whereas in this case, you have to essentially keep a full list of samples from the joint posterior and then marginalize over it yourself. The larger the sample size, the harder this gets, and in fact it is a bit difficult to tell whether the nested sampling method is competing with MCMC or Monte Carlo integration.

Is there any reason why this should not be combined with MCMC? i.e., can we use nested sampling from the burn-in phase to figure out the proposal distributions for Metropolis or Metropolis-Hastings in situ, and get the best of both worlds?

Bend it like Poisson

vlk — Wed, 13 Jun 2007 12:41:03 +0000

I don’t know why astro-ph thought this article on the statistics of football dynamics (Mendes, Malacarne, Anteneodo 2007; physics/0706.1758) was relevant to me and emailed the abstract, but I’m glad they did, because they deal with a question I have wrestled with for a long time: how to figure out the underlying distribution that controls a stochastic process. In 2002ApJ…580.1118K, we dealt with modeling the photon arrival time differences as due to flares occuring at random times but with a power-law intensity distribution with index alpha. physics/0706.1758 deals with time-between-touches and tries to characterize that distribution itself in terms of a number of “phases” beta. From a quick reading, it appears that their beta are our flares, and they restrict all flares to have the same intensity. Despite the restriction, this is interesting because it is an analytical estimation that points a way towards speeding up our flare distribution fitting process, which currently is based on a Monte-Carlo grid search method, not the fastest way to do things.