Archive for June 2007

Quote of the Week, June 27, 2007

I want to use this short quote by Andrew Gelman to highlight many interesting topics at the recent Third Workshop on Monte Carlo Methods. This is part of Andrew Gelman’s empahsis on the fundamental importance of thinking through priors. He argues that “non-informative” priors (explicit, as in Bayes, or implicit, as in some other methods) can in fact be highly constraining, and that weakly informative priors are more honest. At his talk on Monday, May 14, 2007 Andrew Gelman explained:

You want to supply enough structure to let the data speak,
but that’s a tricky thing.

[ArXiv] Classical confidence intervals, June 25, 2007

Comments on the unified approach to the construction of classical confidence intervals

This paper comments on classical confidence intervals and upper limits, as the so-called a flip-flopping problem, both of which are related asymptotically (when n is large enough) by the definition but cannot be converted from one to the another by preserving the same coverage due to the poisson nature of the data.
Continue reading ‘[ArXiv] Classical confidence intervals, June 25, 2007’ »

[ArXiv] Kernel Regression, June 20, 2007

One of the papers from arxiv/astro-ph discusses kernel regression and model selection to determine photometric redshifts astro-ph/0706.2704. This paper presents their studies on choosing bandwidth of kernels via 10 fold cross-validation, choosing appropriate models from various combination of input parameters through estimating root mean square error and AIC, and evaluating their kernel regression to other regression and classification methods with root mean square errors from literature survey. They made a conclusion of flexibility in kernel regression particularly for data at high z.
Continue reading ‘[ArXiv] Kernel Regression, June 20, 2007’ »

Quote of the Week, June 20, 2007

These quotes are in the opposite spirit of the last two Bayesian quotes.
They are from the excellent “R”-based , Tutorial on Non-Parametrics given by
Chad Schafer and Larry Wassserman at the 2006 SAMSI Special Semester on AstroStatistics (or here ).

Chad and Larry were explaining trees:

For more sophistcated tree-searches, you might try Robert Nowak [and his former student, Becca Willett --- especially her "software" pages]. There is even Bayesian CART — Classifcation And Regression Trees. These can take 8 or 9 hours to “do it right”, via MCMC. BUT [these results] tend to be very close to [less rigorous] methods that take only minutes.

Trees are used primarily by doctors, for patients: it is much easier to follow a tree than a kernel estimator, in person.

Trees are much more ad-hoc than other methods we talked about, BUT they are very user friendly, very flexible.

In machine learning, which is only statistics done by computer scientists, they love trees.

[ArXiv] Solar Cycle, June 18, 2007

From arxiv/astro-ph, arXiv:0706.2590v1 Extreme Value Theory and the Solar Cycle by Ramos, A. This paper might drag a large attention from CHASC members.
Continue reading ‘[ArXiv] Solar Cycle, June 18, 2007’ »

[ArXiv] Correlation Studies, June 12, 2007

One of arxiv/astro-ph preprints, arxiv/0706.1703v1 discusses correlation between galactic HI and the cosmic microwave background (CMB) and reports no statistically significant correlation.
Continue reading ‘[ArXiv] Correlation Studies, June 12, 2007’ »

[ArXiv] A Lecture Note, June 17, 2007

From arxiv/astro-ph:0706.1988,
Lectures on Astronomy, Astrophysics, and Cosmology looks helpful to statisticians who like to know astronomy, astrophysics, and cosmology. The lecture note starts from introducing fundamentals of astronomy, UNITS!!!, and its history. It also explains astronomical measures such as distances and their units, luminosity, and temperature; HR diagram (astronomers’ summary diagram); stellar evolution; and relevant topics in cosmology. At least, a third of the article will be useful to grasp a rough idea of astronomy as a scientific subject beyond colorful pictures. Statisticians who are keen to cosmology are recommended to read beyond.

This is not a high energy lecture note; therefore, statisticians interested in high energy are encouraged to visit Astro Jargon for Statisticians and CHASC.

Data Doctors

Terry Speed writes columns for IMS Bulletin and the June 2007 issue has Terence’s Stuff: Data Doctors (p. 7). He quotes Fisher who described a statistician as a post-mortem examiner or a pathologist, but thinks that statisticians (statistical consultants) are doctors who maintain close, active, and alive relationships with their patients.

Nonetheless, I think statisticians working with astronomers are assistants to post-mortem examiners. Most likely, statisticians nor astronomers cannot design experiments with unreachable objects. Astronomers are post-mortem examiners with telescopes and statisticians are assistants with charts which are by products from post-mortem examinations. These assistants may or may not be useful to astronomers.

Quote of the Week, June 12, 2007

This is the second a series of quotes by
Xiao Li Meng
, from an introduction to Markov Chain Monte Carlo (MCMC), given to a room full of astronomers, as part of the April 25, 2006 joint meeting of Harvard’s “Stat 310″ and the California-Harvard Astrostatistics Collaboration. This one has a long summary as the lead-in, but hang in there!

Summary first (from earlier in Xiao Li Meng’s presentation):

Let us tackle a harder problem, with the Metropolis Hastings Algorithm.
An example: a tougher distribution, not Normal in [at least one of the dimensions], and multi-modal… FIRST I propose a draw, from an approximate distribution. THEN I compare it to true distribution, using the ratio of proposal to target distribution. The next draw: tells whether to accept the new draw or stay with the old draw.

Our intuition:
1/ For original Metropolis algorithm, it looks “geometric” (In the example, we are sampling “x,z”; if the point falls under our xz curve, accept it.)

2/ The speed of algorithm depends on how close you are with the approximation. There is a trade-off with “stickiness”.

Practical questions:
How large should say, N be? This is NOT AN EASY PROBLEM! The KEY difficulty: multiple modes in unknown area. We want to know all (major) modes first, as well as estimates of the surrounding areas… [To handle this,] don’t run a single chain; run multiple chains.
Look at between-chain variance; and within-chain variance. BUT there is no “foolproof” here… The starting point should be as broad as possible. Go somewhere crazy. Then combine, either simply as these are independent; or [in a more complicated way as in Meng and Gellman].

And here’s the Actual Quote of the Week:

[Astrophysicist] Aneta Siemiginowska: How do you make these proposals?

[Statistician] Xiao Li Meng: Call a professional statistician like me.
But seriously – it can be hard. But really you don’t need something perfect. You just need something decent.


A Link to Statistical Analysis for the Virtual Observatory is added. Its description with toy data is given at

Nice feature of this website is that the interface allows you to perform various statistical analysis on a data set which is located either at your hard disk or at the Virtual Observatory.

Bend it like Poisson

I don’t know why astro-ph thought this article on the statistics of football dynamics (Mendes, Malacarne, Anteneodo 2007; physics/0706.1758) was relevant to me and emailed the abstract, but I’m glad they did, because they deal with a question I have wrestled with for a long time: how to figure out the underlying distribution that controls a stochastic process. In 2002ApJ…580.1118K, we dealt with modeling the photon arrival time differences as due to flares occuring at random times but with a power-law intensity distribution with index alpha. physics/0706.1758 deals with time-between-touches and tries to characterize that distribution itself in terms of a number of “phases” beta. From a quick reading, it appears that their beta are our flares, and they restrict all flares to have the same intensity. Despite the restriction, this is interesting because it is an analytical estimation that points a way towards speeding up our flare distribution fitting process, which currently is based on a Monte-Carlo grid search method, not the fastest way to do things.

Everything you wanted to know about power-laws but were afraid to ask

Clauset, Shalizi, & Newman (2007, arXiv/0706.1062) have a very detailed description of what power-law distributions are, how to recognize them, how to fit them, etc. They are also making available their matlab and R codes that they use to do the fitting and such.

Looks like a very handy reference text, though I am a bit uncertain about their use of the K-S test to check whether a dataset can be described with a power-law or not. It is probably fine; perhaps some statisticians would care to comment?

GLAST Workshop on June 21 at Science Ctr

GLAST workshop will be held at Science Center (Hall A, located at the 1st floor) of Harvard University. Nice opportunity to learn about GLAST mission and its programs. Free registration and open to everyone. Please, visit for registration and further information.

Quote of the Week, June 5, 2007

This is one in a series of quotes by Xiao Li Meng, from an introduction to Markov Chain Monte Carlo (MCMC), given to a room full of astronomers, as part of the April 25, 2006 joint meeting of Harvard’s “Stat 310″ and the California-Harvard Astrostatistics Collaboration:

These MCMC [Markov Chain Monte Carlo] methods are very general.
BUT anytime it is incredibally general, there is something to worry about.
The same is true for bootstrap – it is very general; and easy to misuse.

All your bias are belong to us

Leccardi & Molendi (2007) have a paper in A&A (astro-ph/0705.4199) discussing the biases in parameter estimation when spectral fitting is confronted with low counts data. Not surprisingly, they find that the bias is higher for lower counts, for standard chisq compared to C-stat, for grouped data compared to ungrouped. Peter Freeman talked about something like this at the 2003 X-ray Astronomy School at Wallops Island (pdf1, pdf2), and no doubt part of the problem also has to do with the (un)reliability of the fitting process when the chisq surface gets complicated.

Anyway, they propose an empirical method to reduce the bias by computing the probability distribution functions (pdfs) for various simulations, and then averaging the pdfs in groups of 3. Seems to work, for reasons that escape me completely.

[Update: links to Peter's slides corrected]