Archive for April 2008

tests of fit for the Poisson distribution

Scheming arXiv:astro-ph abstracts almost an year never offered me an occasion that the fit of the Poisson distribution is tested in different ways, instead it is taken for granted by plugging data and (source) model into a (modified) χ2 function. If any doubts on the Poisson distribution occur, the following paper might be useful: Continue reading ‘tests of fit for the Poisson distribution’ »

[ArXiv] 4th week, Apr. 2008

The last paper in the list discusses MCMC for time series analysis, applied to sunspot data. There are six additional papers about statistics and data analysis from the week. Continue reading ‘[ArXiv] 4th week, Apr. 2008’ »

The LRT is worthless for …

One of the speakers from the google talk series exemplified model based clustering and mentioned the likelihood ratio test (LRT) for defining the number of clusters. Since I’ve seen the examples of ill-mannerly practiced LRTs from astronomical journals, like testing two clusters vs three, or a higher number of components, I could not resist indicating that the LRT is improperly used from his illustration. As a reply, the citation regarding the LRT was different from his plot and the test was carried out to test one component vs. two, which closely observes the regularity conditions. I was relieved not to find another example of the ill-used LRT. Continue reading ‘The LRT is worthless for …’ »

Is 8-sigma significant enough for you?

There is a new report from Bernabei et al. (arXiv:0804.2741) of the direct detection of the effects of Dark Matter that is causing a lot of buzz. (The Bad Astronomer has a good summary.) They find yearly modulation in their detected scintillation rate that matches what you would expect if the Earth were rushing through Galactic Dark Matter as it goes around the Sun. They have worked out the significance of the modulation to be 8.2 sigma. Significant! But significant of what? Continue reading ‘Is 8-sigma significant enough for you?’ »

[ArXiv] Ripley’s K-function

Because of the extensive works by Prof. Peebles and many (observational) cosmologists (almost always I find Prof. Peeble’s book in cosmology literature), the 2 (or 3) point correlation function is much more dominant than any other mathematical and statistical methods to understand the structure of the universe. Unusually, this week finds an astro-ph paper written by a statistics professor addressing the K-function to explore the mystery of the universe.

[astro-ph:0804.3044] J.M. Loh
Estimating Third-Order Moments for an Absorber Catalog

Continue reading ‘[ArXiv] Ripley’s K-function’ »

[ArXiv] 3rd week, Apr. 2008

The dichotomy of outliers; detecting outliers to be discarded or to be investigated; statistics that is robust enough not to be influenced by outliers or sensitive enough to alert the anomaly in the data distribution. Although not related, one paper about outliers made me to dwell on what outliers are. This week topics are diverse. Continue reading ‘[ArXiv] 3rd week, Apr. 2008’ »

AstroGrid Desktop Suite

AstroGrid Desktop Suite is available. Check the AstroGrid website http://www.astrogrid.org for more informations. Continue reading ‘AstroGrid Desktop Suite’ »

PCA

Prof. Speed writes columns for IMS Bulletin and the April 2008 issue has Terence’s Stuff: PCA (p.9). Here are quotes with minor paraphrasing:

Although a quintessentially statistical notion, my impression is that PCA has always been more popular with non-statisticians. Of course we love to prove its optimality properties in our courses, and at one time the distribution theory of sample covariance matrices was heavily studied.

…but who could not feel suspicious when observing the explosive growth in the use of PCA in the biological and physical sciences and engineering, not to mention economics?…it became the analysis tool of choice of the hordes of former physicists, chemists and mathematicians who unwittingly found themselves having to be statisticians in the computer age.

My initial theory for its popularity was simply that they were in love with the prefix eigen-, and felt that anything involving it acquired the cachet of quantum mechanics, where, you will recall, everything important has that prefix.

He gave the following eigen-’s: eigengenes, eigenarrays, eigenexpression, eigenproteins, eigenprofiles, eigenpathways, eigenSNPs, eigenimages, eigenfaces, eigenpatterns, eigenresult, and even eigenGoogle.

How many miracles must one witness before becoming a convert?…Well, I’ve seen my three miracles of exploratory data analysis, examples where I found I had a problem, and could do something about it using PCA, so now I’m a believer.

No need to mention that astronomers explore data with PCA and utilize eigen- values and vectors to transform raw data into more interpretable ones.

Significance of 5 counts

We have talked about it many times. Now I have to work with the reality. My source shows only 5 counts in a short 5 ksec Chandra exposure. Is this a detection of the source? or is this a random fluctuation? Chandra background is low and data are intrinsically Poisson, so the problem should be easy to solve. Not really! There is no tool to calculate this :-) well, no actually it is! Tom A. and I found it by searching Google “Python gamma function” and came out with Tom Loredo’s Python functions (sp_funcs.py) that he translated from Numerical Recipes to Python. This is the working tool! We just needed to change “import Numeric” or “import Numarray” to “import numpy as N” and then it worked.

We calculated the significance of observing 5 counts given the expected background counts of 0.1 using spfunc.gammp(5,0.1) =8e-8. The detection is highly significant.

Any comments?

Lomb-Scargle periodograms in bioinformatics

A statistical method developed by insightful and brilliant astronomers is used in bioinformatics:
Detecting periodic patterns in unevenly spaced gene expression time series using Lomb–Scargle periodograms
by Glynn, Chen, & Mushegian [Click for R code and relevant information] [Paper archive at Bioinformatics]

The conclusion clearly indicates the winning points of the Lomb-Scargle periodograms.

The Lomb-Scargle periodogram algorithm is an effective tool for finding periodic gene expression profiles in microarray data, especially when data may be collected at arbitrary time points or when a significant proportion of data is missing.

My personal wish is that data driven statistical methods by hands on scientists (and their statistical collaborators) are to be used in other disciplines because I believe data sets are likely to share the unknown truth of our one universe.

The Burden of Reviewers

Astronomers write literally thousands of proposals each year to observe their favorite targets with their favorite telescopes. Every proposal must be accompanied by a technical justification, where the proposers demonstrate that their goal is achievable, usually via a simulation. Surprisingly, a large number of these justifications are statistically unsound. Guest Slogger Simon Vaughan describes the problem and shows what you can do to make reviewers happy (and you definitely want to keep reviewers happy).
Continue reading ‘The Burden of Reviewers’ »

Kepler and the Art of Astrophysical Inference

I recently discovered iTunesU, and I have to confess, I find it utterly fascinating. By golly, it is everything that they promised us that the internet would be. Informative, entertaining, and educational. What are the odds?!? Anyway, while poking around the myriad lectures, courses, and talks that are now online, I came across a popular Physics lecture series at UMichigan which listed a talk by one of my favorite speakers, Owen Gingerich. He had spoken about The Four Myths of the Copernican Revolution last November. It was, how shall we say, riveting.

Owen talks in detail about how the Copernican model came to supplant the Ptolemaic model. In particular, he describes how Kepler went from Ptolemaic epicycles to elliptical orbits. Contrary to general impression, Kepler did not fit ellipses to Tycho Brahe’s observations of Mars. The ellipticity is far too small for it to be fittable! But rather, he used logical reasoning to first offset Earth’s epicyle away from the center in order to avoid the so-called Martian Catastrophe, and then used the phenomenological constraint of the law of equal areas to infer that the path must be an ellipse.

This process, along with Galileo’s advocacy for the heliocentric system, demonstrates a telling fact about how Astrophysics is done in practice. Hyunsook once lamented that astronomers seem to be rather trigger happy with correlations and regressions, and everyone knows they don’t constitute proof of anything, so why do they do it? Owen says about 39 1/2 minutes into the lecture:

Here we have the fourth of the myths, that Galileo’s telescopic observations finally proved the motion of the earth and thereby, at last, established the truth of the Copernican system.

What I want to assure you is that, in general, science does not operate by proofs. You hear that an awful lot, about science looking for propositions that can be falsified, that proof plays this big role.. uh-uh. It is coherence of explanation, understanding things that are well-knit together; the broader the framework of knitting the things together, the more we are able to believe it.

Exactly! We build models, often with little justification in terms of experimental proof, and muddle along trying to make it fit into a coherent narrative. This is why statistics is looked upon with suspicion among astronomers, and why for centuries our mantra has been “if it takes statistics to prove it, it isn’t real!”

[ArXiv] 2nd week, Apr. 2008

Markov chain Monte Carlo became the most frequent and stable statistical application in astronomy. It will be useful collecting tutorials from both professions. Continue reading ‘[ArXiv] 2nd week, Apr. 2008’ »

[ArXiv] use of the median

The breakdown point of the mean is asymptotically zero whereas the breakdown point of the median is 1/2. The breakdown point is a measure of the robustness of the estimator and its value reaches up to 1/2. In the presence of outliers, the mean cannot be a good measure of the central location of the data distribution whereas the median is likely to locate the center. Common plug-in estimators like mean and root mean square error may not provide best fits and uncertainties because of this zero breakdown point of the mean. The efficiency of the mean estimator does not guarantee its unbiasedness; therefore, a bit of care is needed prior to plugging in the data into these estimators to get the best fit and uncertainty. There was a preprint from [arXiv] about the use of median last week. Continue reading ‘[ArXiv] use of the median’ »

Google Sky

For people in the Boston area, a cornucopia of talks on Google Sky in the near future.

  1. Hunting for Needles in Massive Astronomical Data Streams
    Wednesday, April 9, 2008 at 4pm
    Room 330, 60 Oxford St.
    Ryan Scranton, Google Sky Team
  2. Inside Google Sky
    Wednesday, April 9, 2008 at 8pm
    Room 105, Emerson Hall
    Andrew Connolly, Google Sky Team
  3. Sky in Google Earth
    Tuesday, April 15, 2008 at 1pm
    Phillips Auditorium, 60 Garden
    Alberto Conti & Carol Christian, STScI