The AstroStat Slog » inference

[MADS] Kalman Filter

hlee — Fri, 02 Oct 2009 03:18:32 +0000

I decide to discuss Kalman Filter a while ago for the slog after finding out that this popular methodology is rather underrepresented in astronomy. However, it is not completely missing from ADS. I see that the fulltext search and all bibliographic source search shows more results. Their use of Kalman filter, though, looked similar to the usage of “genetic algorithms” or “Bayes theorem.” Probably, the broad notion of Kalman filter makes it difficult my finding Kalman Filter applications by its name in astronomy since often wheels are reinvented (algorithms under different names have the same objective).

When I learned “Kalman filter” for the first time, I was not sure how to distinguish it from “Yule-Walker equation” (time series), “Pade approximant, (unfortunately, the wiki page does not have its matrix form). Wiener Filter” (signal processing), etc. Here are those publications, specifically mentioned the name Kalman filter in their abstracts found from ADS.

Application of Data Assimilation Method for Predicting Solar Cycles (2008) in ApJ
Time series analysis in astronomy: Limits and potentialities (2005) in A&A
From the abstract: Only techniques of data aalysis developed in a specific physical context can be expected to provide useful results. The filed of stochastic dynamics appears to be an interesting framework for such an approach.
Determination of the Mass of Jupiter Using the Motion of Its Ninth Satellite and a Kaiman-Bucy Filter (1972) in A&A

The motivation of introducing Kalman filter although it is a very well known term is the recent Fisher Lecture given by Noel Cressie at the JSM 2009. He is the leading expert in spatial statistics. He is the author of a very famous book in Spatial Statistics. During his presentation, he described challenges from satellite data and how Kalman filter accelerated computing a gigantic covariance matrix in kriging. Satellite data of meteorology and geosciences may not exactly match with astronomical satellite data but from statistical modeling perspective, the challenges are similar. Namely, massive data, streaming data, multi dimensional, temporal, missing observations in certain areas, different exposure time, estimation and prediction, interpolation and extrapoloation, large image size, and so on. It’s not just focusing denoising/cleaning images. Statisticians want to find the driving force of certain features by modeling and to perform statistical inference. (They do not mind parametrization of interesting metric/measure/quantity for modeling or they approach the problem in a nonparametric fashion). I understood the use of Kalman filter for a fast solution to inverse problems for inference.

a century ago

hlee — Thu, 07 May 2009 19:22:37 +0000

Almost 100 years ago, A.S. Eddington stated in his book Stellar Movements (1914) that

…in calculating the mean error of a series of observations it is preferable to use the simple mean residual irrespective of sign rather than the mean square residual

Such eminent astronomer said already least absolute deviation over chi-square, if I match simple mean residual and mean square residual to relevant methodologies, in order.

I guess there is a reason everything is done based on the chi-square although a small fraction of the astronomy community is aware of that the chi-square minimization is not the only utility function for finding best fits. The assumption that the residuals “(Observed – Expected)/sigma”, as written in chi-square methods, are (asymptotically) normal – Gaussian, is freely taken into account by astronomical data (astronomers who analyze these data) mainly because of their high volume. The worst case is that even if checking procedures for the Gaussianity are available from statistical literature, applying those procedures to astronomical data is either difficult or ignored. Anyway, if one is sure that the data/parameters of interest are sampled from normal distribution, Eddington’s statement is better to be reverted because of sufficiency. We also know the asymptotic efficiency of sample standard deviation when the probability density function satisfies more general regularity conditions than the Gaussian density.

As a statistician, it is easy to say, “assume that data are iid standard normal, wlog.” And then, develop a statistic, investigate its characteristics, and compare it with other statistics. If this statistics does not show promising results from the comparison and strictly suffers from this normality assumption, then statisticians will attempt to make this statistic robust by checking and relaxing assumptions and math. On the other hand, I’m not sure how much astronomers feel easy with this Gaussianity assumption in their data most of which are married to statistics or statistical tools based on the normal assumption. How often have the efforts of devising the statistic and trying different statistics been taken place?

Without imposing the Gaussianity assumption, I think that Eddington’s comment is extremely insightful. Commonly cited statistical methods in astronomy, like chi square methods, are built on Gaussianity assumption from which sample standard deviation is used for σ, the scale parameter of the normal distribution that is mapped to 68% coverage and multiple of the sample standard deviation correspond to well known percentiles as given in Numerical Recipes. In the end, I think statistical analysis in astronomy literature suffers from a dilemma, “which came first, the chicken or the egg?” On the other hand, I feel setback because such a insightful comment from one of the most renown astrophysicists didn’t gain much weight after many decades. My understanding that Eddington’s suggestion was ignored is acquired from reading only a fraction of publications in major astronomical journals; therefore, I might be wrong. Probably, astronomers use LAD and do robust inferences more often that I think.

Unless not sure about the Gaussianity in data (covering the sampling distribution, residuals between observed and expected, and some transformations), for inference problems, sample standard deviation may not be appropriate to get error bars with matching coverage. Estimating posterior distributions is a well received approach among some astronomers and there are good tutorials and textbooks about Bayesian data analysis for astronomers. Those familiar with basics of statistics, pyMC and its tutorial (or another link from python.org) will be very useful for proper statistical inference. If Bayesian computation sounds too cumbersome, for the simplicity, follow Eddington’s advice. Instead of sample standard deviation, use absolute mean deviation (simple mean residual, Eddington’s words) to quantify uncertainty. Perhaps, one wants to compare best fits and error bars from both strategies.

——————————————————
This story was inspired by Studies in the Hisotry of Probability and Statistics. XXXII: Laplace, Fisher, and the discovery of the concept of sufficiency by Stigler (1973) Biometrika v. 60(3), p.439. The quote of Eddington was adapted from this article. Another quote from this article I like to share:

Fisher went on to observe that this property of σ₂^[1] is quite dependent on the assumption that the population is normal, and showed that indeed σ₁^[2] is preferable to σ₂, at least in large samples, for estimating the scale parameter of the double exponential distribution, providing both estimators are appropriately rescaled

By assuming that each observations is normally (Gaussian) distributed with mean (mu) and variance (sigma^2), and that the object was to estimate sigma, Fisher proved that the sample standard deviation (or mean square residual) is more efficient than the mean deviation form the sample mean (or simple mean residual). Laplace proved it as well. The catch is that assumptions come first, not the sample standard deviation for estimating error (or sigma) of unknown distribution.

sample standard deviation
mean deviation from the sample mean

[Announce] Heidelberg Summer School

chasc — Wed, 25 Mar 2009 14:13:00 +0000

From Christian Fendt comes this announcement:

——————————————————————
First Announcement and Call for Applications
——————————————————————

The “International Max Planck Research School for Astronomy & Cosmic Physics at the University of Heidelberg” (IMPRS-HD)

announces the

— 4th Heidelberg Summer School:

— Statistical Inferences from Astrophysical Data

— August 10-14, 2009

IMPRS Heidelberg invites graduate students and postdocs to its 4th Heidelberg Summer School. This year’s school is centered on how to draw scientific inferences from astrophysical data sets. We will also discuss proper statistical methods that are crucial for testing specific astrophysical models.

The school will present essential statistical concepts and techniques. These concepts will be illustrated through various astrophysical examples. Approaches such as Monte Carlo, maximum likelihood techniques, Bayesian statistics, parametric tests, biases in censored/incomplete data, or time-series analysis will be applied in computer exercises.

The main lecturing program is presented by invited speakers and is accompanied by practical exercises and also science talks on specific topics by local experts.

Invited lecturers are:

— David W. HOGG, New York University

— Ian McHARDY, University of Southampton

— William H. PRESS, University of Texas, Austin

Deadline for application is June 15, 2009.

Please find more information, our poster, and the application
forms under
www.mpia.de/imprs-hd/
www.mpia.de/imprs-hd/SummerSchools/2009/

A limited number of grants are available to partially cover travel expenses of participants.

IMPRS-HD is an independent part of the Heidelberg Graduate School for Fundamental Physics.

accessing data, easier than before but…

hlee — Tue, 20 Jan 2009 17:59:56 +0000

Someone emailed me for globular cluster data sets I used in a proceeding paper, which was about how to determine the multi-modality (multiple populations) based on well known and new information criteria without binning the luminosity functions. I spent quite time to understand the data sets with suspicious numbers of globular cluster populations. On the other hand, obtaining globular cluster data sets was easy because of available data archives such as VizieR. Most data sets in charts/tables, I acquire those data from VizieR. In order to understand science behind those data sets, I check ADS. Well, actually it happens the other way around: check scientific background first to assess whether there is room for statistics, then search for available data sets.

However, if you are interested in massive multivariate data or if you want to have a subsample from a gigantic survey project, impossible all to be documented in contrast to those individual small catalogs, one might like to learn a little about Structured Query Language (SQL). With nice examples and explanation, some Tera byte data are available from SDSS. Instead of images in fits format, one can get ascii/table data sets (variables of million objects are magnitudes and their errors; positions and their errors; classes like stars, galaxies, AGNs; types or subclasses like elliptical galaxies, spiral galaxies, type I AGN, type Ia, Ib, Ic, and II SNe, various spectral types, etc; estimated variables like photo-z, which is my keen interest; and more). Furthermore, thousands of papers related to SDSS are available to satisfy your scientific cravings. (Here are Slog postings under SDSS tag).

If you don’t want to limit yourself with ascii tables, you may like to check the quick guide/tutorial of Gator, which aggregated archives of various missions: 2MASS (Two Micron All-Sky Survey), IRAS (Infrared Astronomical Satellite), Spitzer Space Telescope Legacy Science Programs, MSX (Midcourse Space Experiment), COSMOS (Cosmic Evolution Survey), DENIS (Deep Near Infrared Survey of the Southern Sky), and USNO-B (United States Naval Observatory B1 Catalog). Probably, you also want to check NED or NASA/IPAC Extragalactic Database. As of today, the website said, 163 million objects, 170 million multiwavelength object cross-IDs, 188 thousand associations (candidate cross-IDs), 1.4 million redshifts, and 1.7 billion photometric measurements are accessible, which seem more than enough for data mining, exploring/summarizing data, and developing streaming/massive data analysis tools.

Probably, astronomers might wonder why I’m not advertising Chandra Data Archive (CDA) and its project oriented catalog/database. All I can say is that it’s not independent statistician friendly. It is very likely that I am the only statistician who tried to use data from CDA directly and bother to understand the contents. I can assure you that without astronomers’ help, the archive is just a hot potato. You don’t want to touch it. I’ve been there. Regardless of how painful it is, I’ve kept trying to touch it since It’s hard to resist after knowing what’s in there. Fortunately, there are other data scientist friendly archives that are quite less suffering compared to CDA. There are plethora things statisticians can do to improve astronomers’ a few decade old data analysis algorithms based on Gaussian distribution, iid assumption, or L₂ norm; and to reflect the true nature of data and more relaxed assumptions for robust analysis strategies than for traditionally pursued parametric distribution with specific models (a distribution free method is more robust than Gaussian distribution but the latter is more efficient) not just with CDA but with other astronomical data archives. The latter like vizieR or SDSS provides data sets which are less painful to explore with without astronomical software/package familiarity.

Computer scientists are well aware of UCI machine learning archive, with which they can validate their new methods with previous ones and empirically prove how superior their methods are. Statisticians are used to handle well trimmed data; otherwise we suggest strategies how to collect data for statistical inference. Although tons of data collecting and sampling protocols exist, most of them do not match with data formats, types, natures, and the way how data are collected from observing the sky via complexly structured instruments. Some archives might be extensively exclusive to the funded researchers and their beneficiaries. Some archives might be super hot potatoes with which no statistician wants to involve even though they are free of charges. I’d like to warn you overall not to expect the well tabulated simplicity of text book data sets found in exploratory data analysis and machine learning books.

Some one will raise another question why I do not speculate VOs (virtual observatories, click for slog postings) and Google Sky (click for slog postings), which I praised in the slog many times as good resources to explore the sky and to learn astronomy. Unfortunately, for the purpose of direct statistical applications, either VOs or Google sky may not be fancied as much as their names’ sake. It is very likely spending hours exploring these facilities and later you end up with one of archives or web interfaces that I mentioned above. It would be easier talking to your nearest astronomer who hopefully is aware of the importance of statistics and could offer you a statistically challenging data set without worries about how to process and clean raw data sets and how to build statistically suitable catalogs/databases. Every astronomer of survey projects builds his/her catalog and finds common factors/summary statistics of the catalog from the perspective of understanding/summarizing data, the primary goal of executing statistical analyses.

I believe some astronomers want to advertise their archives and show off how public friendly they are. Such advertising comments are very welcome because I intentionally left room for those instead of listing more archives I heard of without hands-on experience. My only wish is that more statisticians can use astronomical data from these archives so that the application section of their papers is filled with data from these archives. As if with sunspots, I wish that more astronomical data sets can be used to validate methodologies, algorithms, and eventually theories. I sincerely wish that this shall happen in a short time before I become adrift from astrostatistics and before I cannot preach about the benefits of astronomical data and their archives anymore to make ends meet.

There is no single well known data repository in astronomy like UCI machine learning archive. Nevertheless, I can assure you that the nature of astronomical data and catalogs bear various statistical problems and many of those problems have never been formulated properly towards various statistical inference problems. There are so many statistical challenges residing in them. Not enough statisticians bother to look these data because of the gigantic demands for statisticians from uncountably many data oriented scientific disciplines and the persistent shortage in supplies.

Kepler and the Art of Astrophysical Inference

vlk — Wed, 16 Apr 2008 22:49:18 +0000

I recently discovered iTunesU, and I have to confess, I find it utterly fascinating. By golly, it is everything that they promised us that the internet would be. Informative, entertaining, and educational. What are the odds?!? Anyway, while poking around the myriad lectures, courses, and talks that are now online, I came across a popular Physics lecture series at UMichigan which listed a talk by one of my favorite speakers, Owen Gingerich. He had spoken about The Four Myths of the Copernican Revolution last November. It was, how shall we say, riveting.

Owen talks in detail about how the Copernican model came to supplant the Ptolemaic model. In particular, he describes how Kepler went from Ptolemaic epicycles to elliptical orbits. Contrary to general impression, Kepler did not fit ellipses to Tycho Brahe’s observations of Mars. The ellipticity is far too small for it to be fittable! But rather, he used logical reasoning to first offset Earth’s epicyle away from the center in order to avoid the so-called Martian Catastrophe, and then used the phenomenological constraint of the law of equal areas to infer that the path must be an ellipse.

This process, along with Galileo’s advocacy for the heliocentric system, demonstrates a telling fact about how Astrophysics is done in practice. Hyunsook once lamented that astronomers seem to be rather trigger happy with correlations and regressions, and everyone knows they don’t constitute proof of anything, so why do they do it? Owen says about 39 1/2 minutes into the lecture:

Here we have the fourth of the myths, that Galileo’s telescopic observations finally proved the motion of the earth and thereby, at last, established the truth of the Copernican system.

What I want to assure you is that, in general, science does not operate by proofs. You hear that an awful lot, about science looking for propositions that can be falsified, that proof plays this big role.. uh-uh. It is coherence of explanation, understanding things that are well-knit together; the broader the framework of knitting the things together, the more we are able to believe it.

Exactly! We build models, often with little justification in terms of experimental proof, and muddle along trying to make it fit into a coherent narrative. This is why statistics is looked upon with suspicion among astronomers, and why for centuries our mantra has been “if it takes statistics to prove it, it isn’t real!”

Prof. Brad Efron visits Harvard

hlee — Tue, 25 Mar 2008 00:03:49 +0000

Bradley Efron, Stanford University
11:00 AM, Friday, April 4, 2008
Sever Hall Rm. 103
Title: SIMULTANEOUS INFERENCE: WHEN SHOULD HYPOTHESIS TESTING PROBLEMS BE COMBINED
Its abstract and other informations at http://www.stat.harvard.edu/Colloquia_Content/Efron08.pdf

Recently awarded Recent National Medal of Science, Statistics Professor Efron of Standford University has played a major role in many ground breaking interdisciplinary collaborations including astronomy. A quote from his website

I like working on applied and theoretical problems at the same time and one thing nice about statistics is that you can be useful in a wide variety of areas. So my current applications include biostatistics and also astrophysical applications. The surprising thing is that the methods used are similar in both areas. I recently gave a talk called Astrophysics and Biostatistics–the odd couple at Penn State that made this point.

This seminar will help grasping his brilliant insights of applying statistics to other disciplines as well as his ingenious mind on statistics. In particular, multiple testing is an growing interest among high energy physicists.

Click the Harvard Statistics Colloquium Series for more information.

The GREAT08 Challenge

vlk — Fri, 29 Feb 2008 03:46:49 +0000

Grand statistical challenges seem to be all the rage nowadays. Following on the heels of the Banff Challenge (which dealt with figuring out how to set the bounds for the signal intensity that would result from the Higgs boson) comes the GREAT08 Challenge (arxiv/0802.1214) to deal with one of the major issues in observational Cosmology, the effect of dark matter. As Douglas Applegate puts it:

We are organizing a competition specifically targeting the statistics and computer science communities. The challenge is to measure cosmic shear at a level sufficient for future surveys such as the Large Synaptic Survey Telescope. Right now, we’ve stripped out most of complex observational issues leaving a pure statistical inference problem. The competition kicks off this summer, but we want to give possible participants a chance to prepare.

The website www.great08challenge.info will provide continual updates on the competition.

[ArXiv] Post Model Selection, Nov. 7, 2007

hlee — Wed, 07 Nov 2007 15:57:01 +0000

Today’s arxiv-stat email included papers by Poetscher and Leeb, who have been working on post model selection inference. Sometimes model selection is misled as a part of statistical inference. Simply, model selection can be considered as a step prior to inference. How you know your data are from chi-square distribution, or gamma distribution? (this is a model selection problem with nested models.) Should I estimate the degree of freedom, k from Chi-sq or α and β from gamma to know mean and error? Will the errors of the mean be same from both distributions?

Prior to estimating means and errors of parameters, one wishes to choose a model where parameters of interests are properly embedded. The arising problem is one uses the same data to choose a model (e.g. choosing the model with the largest likelihood value or bayes factor) as well as to perform statistical inference (estimating parameters, calculating confidence intervals and testing hypotheses), which inevitably introduces bias. Such bias has been neglected in general (a priori tells what model to choose: e.g. the 2nd order polynomial is the absolute truth and the residuals are realizations of the error term, by the way how one can sure that the error follows normal distribution?). Asymptotics enables this bias to be O(n^m), where m is smaller than zero. Estimating this bias has been popular since Akaike introduced AIC (one of the most well known model selection criteria). Numerous works are found in the field of robust penalized likelihood. Variable selection has been a very hot topic in a recent few decades. Beyond my knowledge, there were more approaches to cope with this bias not to contaminate the inference results.

The works by Professors Poetscher and Leeb looked unique to me in the line of resolving the intrinsic bias arise from inference after model selection. In stead of being listed in my weekly arxiv lists, their arxiv papers deserved to be listed under a separate posting. I also included some more general references.

The list of paper from today’s arxiv:

[stat.TH:0702703] Can one estimate the conditional distribution of post-model-selection estimators? by H. Leeb and B. M. P\”{o}tscher
[stat.TH:0702781] The distribution of model averaging estimators and an impossibility result regarding its estimation by B. M. P\”{o}tscher
[stat.TH:0704.1466] Sparse Estimators and the Oracle Property, or the Return of Hodges’ Estimator by H. Leeb and B. M. Poetscher
[stat.TH:0711.0660] On the Distribution of Penalized Maximum Likelihood Estimators: The LASSO, SCAD, and Thresholding by B. M. Poetscher, and H. Leeb
[stat.TH:0701781] Learning Trigonometric Polynomials from Random Samples and Exponential Inequalities for Eigenvalues of Random Matrices by K. Groechenig, B.M. Poetscher, and H. Rauhut

Other resources:

Prof. Leeb’s website has other published papers
Effects of Model Selection on Inference B.M.Potscher, Econometric Theory, Vol. 7, No. 2 (Jun., 1991), pp. 163-185
The Effect of Model Selection on Confidence Regions and Prediction Regions P.Kabaila, Econometric Theory, Vol. 11, No. 3 (Aug., 1995), pp. 537-549
Model Selection and Multi-Model Inference: a book by Burnham and Anderson
modelselection.org: it’s a model selection website but looks like pageant show website.

[Added on Nov.8th] There were a few more relevant papers from arxiv.

[stat.AP:0711.0993] Upper bounds on the minimum coverage probability of confidence intervals in regression after variable selection by P. Kabaila and K. Giri
[stat.ME:0710.1036] Confidence Sets Based on Sparse Estimators Are Necessarily Large by B. M. Pötscher