The AstroStat Slog » aconnors

Quote of the Week, Oct 12, 2007

aconnors — Sat, 13 Oct 2007 00:04:08 +0000

This is an unusual Quote-of-the-week, in that I point you to [ABSTRACT] and a [VIDEO] of the recent talk at the Institute for Innovative Computing. See what you think!

Leland Wilkinson points out that flexible, generic, plotting programs, such as SigmaPlot, allow one to make “ungrammatical” plots. That is, one could take a histogram of a log of one’s data; and then find the mean. That is meaningless, he points out. He persuavely argues that one can define a strict “grammar” of procedures to avoid such “meaningless” visualizations due to unthinkingly swapping Two summarizing quotes stand out:

I assert the Grammar of Graphics is THE grammar of statistical visualization.
…
There is always a model. In any statistical visualization, we always have a model.
What is it? What is the mathematical model?

Provocative Corollary to Andrew Gelman’s Folk Theorem

aconnors — Wed, 03 Oct 2007 20:08:43 +0000

This is a long comment on October 3, 2007 Quote of the Week, by Andrew Gelman. His “folk theorem” ascribes computational difficulties to problems with one’s model.

My thoughts:
Model , for statisticians, has two meanings. A physicist or astronomer would automatically read this as pertaining to a model of the source, or physics, or sky. It has taken me a long time to be able to see it a little more from a statistics perspective, where it pertains to the full statistical model.

For example, in low-count high-energy physics, there had been a great deal of heated discussion over how to handle “negative confidence intervals”. (See for example PhyStat2003). That is, when using the statistical tools traditional to that community, one had such a large number of trials and such a low expected count rate that a significant number of “confidence intervals” for source intensity were wholly below zero. Further, there were more of these than expected (based on the assumptions in those traditional statistical tools). Statisticians such as David van Dyk pointed out that this was a sign of “model mis-match”. But (in my view) this was not understood at first — it was taken as a description of physics model mismatch. Of course what he (and others) meant was statistical model mismatch. That is, somewhere along the data-processing path, some Gauss-Normal assumptions had been made that were inaccurate for (essentially) low-count Poisson. If one took that into account, the whole “negative confidence interval” problem went away. In recent history, there has been a great deal of coordinated work to correct this and do all intervals properly.

This brings me to my second point. I want to raise a provocative corollary to Gelman’s folk theoreom:

When the “error bars” or “uncertainties” are very hard to calculate, it is usually because of a problem with the model, statistical or otherwise.

One can see this (I claim) in any method that allows one to get a nice “best estimate” or a nice “visualization”, but for which there is no clear procedure (or only an UNUSUALLY long one based on some kind of semi-parametric bootstrapping) for uncertainty estimates. This can be (not always!) a particular pitfall of “ad-hoc” methods, which may at first appear very speedy and/or visually compelling, but then may not have a statistics/probability structure through which to synthesize the significance of the results in an efficient way.

Quote of the Week, October 3, 2007

aconnors — Wed, 03 Oct 2007 16:09:15 +0000

From the ever-quotable Andrew Gelman comes this gem, which he calls a Folk Theorem :

When things are hard to compute, often the model doesn’t fit the data. Difficulties in computation are therefore often model problems… [When the computation isn't working] we have the duty and freedom to think about models.

From the introduction to his talk on May 14, 2007, at the Third Workshop on Monte Carlo Methods, Harvard University, 2007 .

Quote of the Week, Aug 31, 2007

aconnors — Sat, 01 Sep 2007 03:47:06 +0000

Once again, the middle of a recent (Aug 30-31, 2007) argument within CHASC, on why physicists and astronomers view “3 sigma” results with suspicion and expect (roughly) > 5 sigma; while statisticians and biologists typically assume 95% is OK:

David van Dyk (representing statistics culture):

Can’t you look at it again? Collect more data?

Vinay Kashyap (representing astronomy and physics culture):

…I can confidently answer this question: no, alas, we usually cannot look at it again!!

Ah. Hmm. To rephrase [the question]: if you have a “7.5 sigma” feature, with a day-long [imaging Markov Chain Monte Carlo] run you can only show that it is “>3sigma”, but is it possible, even with that day-long run, to tell that the feature is really at 7.5sigma — is that the question? Well that would be nice, but I don’t understand how observing again will help?

David van Dyk :

No one believes any realistic test is properly calibrated that far into the tail. Using 5-sigma is really just a high bar, but the precise calibration will never be done. (This is a reason not to sweet the computation TOO much.)

Most other scientific areas set the bar lower (2 or 3 sigma) BUT don’t really believe the results unless they are replicated.

My assertion is that I find replicated results more convincing than extreme p-values. And the controversial part: Astronomers should aim for replication rather than worry about 5-sigma.

Quote of the Week, Aug 23, 2007

aconnors — Fri, 24 Aug 2007 03:08:19 +0000

These are from two lively CHASC discussions on classification, or cluster analysis. The first was on Feb 7, 2006; the continuation on Dec 12, 2006, at the Harvard Statistics Department, as part of Stat 310 .

David van Dyk:

Don’t demand too much of the classes. You’re not going to say that all events can be well-classified…. It’s more descriptive. It gives you places to look. Then you look at your classes.

Xiao Li Meng:

Then you’re saying the cluster analysis is more like -

David van Dyk:

It’s really like you have a propsal for classes. You then investigate the physical processes more thoroughly. You may have classes that divide it [up]

……

David van Dyk:

But it can make a difference, where you see the clusters, depending on your [parameter] transformation.You can squish the white spaces, and stretch out the crowded spaces; so it can change where you think the clusters are.

Aneta Siemignowska:

But that is interesting.

Andreas Zezas:

Yes, that is very interesting.

These are particularly in honor of Hyunsook Lee‘s recent posting of Chattopadhyay et. al.’s new work about possible intrinsic classes of gamma-ray bursts. Are they really physical classes — or do they only appear to be distinct clusters because we view them through the “squished” lens (parameter spaces) of our imperfect instruments?

Quote of the Week, August 2, 2007

aconnors — Fri, 03 Aug 2007 02:18:50 +0000

Some of the lively discussion at the end of the first “Statistical Challenges in Modern Astronomy” conference, at Penn State in 1991, was captured in the proceedings (“General Discussion: Working on the Interface Between Statstics and Astronomy, Terry Speed (Moderator)”, in SCMA I, editors Eric D. Feigelson and G. Jogesh Babu, 1992, Springer-Verlag, New York,p 505).
Joseph Horowitz (Statistician):

…there should be serious collaboration between astronomers and statisticians. Statisticians should be involved from the beginning as real collaborators, not mere number crunchers. When I collaborate with anybody, astronomer or otherwise, I expect to be a full scientific equal and to get something out of it of value to statistics or mathematics, in addition to making a contribution to the collaborator’s field…

Jasper Wall (Astrophysicist):

…I feel strongly that the knowledge of statistics needs to come very early in the process. It is no good downstream when the paper is written. It is not even much good when you have built the instrument, because we should disabuse statisticians of any impression that the data coming from astronomical instruments are nice, pure, and clean. Each instrument has its very own particular filter, each person using that instrument puts another filter on it and each method of data acquisition does something else yet again. I get more and more concerned particularly at the present time [1991] of data explosion (the observatory I work with is getting 700 MBy per night!). There is discussion of data compression, cleaning on-line, and other treatments even before the observing astronomer gets the data. The knowledge of statistics and the knowledge of what happens to the data need to come extremely early in the process.

Quote of the Week, July 26, 2007

aconnors — Fri, 27 Jul 2007 18:46:59 +0000

Peter Bickel:

“Bayesian” methods have, I think, rightly gained favor in astronomy
as they have in other fields of statistical application. I put “Bayesian” in quotation marks because I do not believe this marks a revival in the sciences in the belief in personal probability. To me it rather means that all information on hand should be used
in model construction, coupled with the view of Box[1979 etc], who considers himself a Bayesian:

Models, of course, are never true but fortunately it is only necessary that they be useful.

The Bayesian paradigm permits one to construct models and hence statistical methods which reflect such information in an, at least in principle, marvellously simple way. A frequentist such as myself feels as at home with these uses of Bayes principle
as any Bayesian.

From Bickel, P. J. “An Overview of SCMA II”, in Statistical Challenges in Modern Astronomy II, editors G. Jogesh Babu and Eric D. Feigelson, 1997, Springer-Verlag, New York,p 360.

[Box 1979] Box, G. E. P. , 1979, “Some Problems of statistics and everyday life”. J. Amer. Statst. Assoc., 74, 1-4.

Peter Bickle had so many interesting perspectives in his comments at these SCMA conferences that it was hard to choose just one set.

Quote of the Week, July 19, 2007

aconnors — Fri, 20 Jul 2007 03:01:16 +0000

Ten years ago, Astrophysicist John Nousek had this answer to Hyunsook Lee’s question “What is so special about chi square in astronomy?”:

The astronomer must also confront the problem that results need to be published and defended. If a statistical technique has not been widely applied in astronomy before, then there are additional burdens of convincing the journal referees and the community at large that the statistical methods are valid.

Certain techniques which are widespread in astronomy and seem to be accepted without any special justification are: linear and non-linear regression (Chi-Square analysis in general), Kolmogorov-Smirnov tests, and bootstraps. It also appears that if you find it in Numerical Recipes (Press etal. 1992) that it will be more likely to be accepted without comment.

…Note an insidious effect of this bias, astronomers will often choose to utilize a widely accepted statistical tool, even into regimes where the tool is known to be invalid, just to avoid the problem of developping or researching appropriate tools.

From pg 205, in “Discussion by John Nousek” (of Edward J. Wegman et. al., “Statistical Software, Siftware, and Astronomy”), in Statistical Challenges in Modern Astronomy II”, editors G. Jogesh Babu and Eric D. Feigelson, 1997, Springer-verlag, New York.

Quote of the Week, July 12, 2007

aconnors — Thu, 12 Jul 2007 19:37:21 +0000

This is from the very interesting Ingrid Daubechies interview by Dorian Devins,
www.nasonline.org/interviews_daubechies, National Academy of Sciences, U.S.A., 2004. It is from part 6, where Ingrid Daubechies speaks of her early mathematics paper on wavelets. She tries to put the impact into context:

I really explained in the paper where things came from. Because, well, the mathematicians wouldn’t have known. I mean, to them this would have been a question that really came out of nowhere. So, I had to explain it …

I was very happy with [the paper]; I had no inkling that it would take off like that… [Of course] the wavelets themselves are used. I mean, more than even that. I explained in the paper how I came to that. I explained both [a] mathematicians way of looking at it and then to some extent the applications way of looking at it. And I think engineers who read that had been emphasizing a lot the use of Fourier transforms. And I had been looking at the spatial domain. It generated a different way of considering this type of construction. I think, that was the major impact. Because then other constructions were made as well. But I looked at it differently. A change of paradigm. Well, paradigm, I never know what that means. A change of … a way of seeing it. A way of paying attention.

Quote of the Week, July 5, 2007

aconnors — Thu, 05 Jul 2007 20:13:10 +0000

Jeff Scargle (in person [top] and in wavelet transform [bottom], left) weighs in on our continuing discussion on how well “automated fitting”/”Machine Learning” can really work (private communication, June 28, 2007):

It is clearly wrong to say that automated fitting of models to data is impossible. Such a view ignores progress made in the area of machine learning and data mining. Of course there can be problems, I believe mostly connected with two related issues:

* Models that are too fragile (that is, easily broken by unusual data)
* Unusual data (that is, data that lie in some sense outside the arena that one expects)

The antidotes are:
(1) careful study of model sensitivity
(2) if the context warrants, preprocessing to remove “bad” points
(3) lots and lots of trial and error experiments, with both data sets that are as realistic as possible and ones that have extremes (outliers, large errors, errors with unusual properties, etc.)
Trial … error … fix error … retry …

You can quote me on that.

This ilustration is from Jeff Scargle’s First GLAST Symposium (June 2007) talk, pg 14, demonstrating the use of inverse area of Voroni tesselations, weighted by the PSF density, as an automated measure of the density of Poisson Gamma-Ray counts on the sky.