The AstroStat Slog » normal http://hea-www.harvard.edu/AstroStat/slog Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 http://wordpress.org/?v=3.4 [SPS] Testing Completeness http://hea-www.harvard.edu/AstroStat/slog/2008/sps-completeness/ http://hea-www.harvard.edu/AstroStat/slog/2008/sps-completeness/#comments Wed, 19 Nov 2008 05:34:59 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=353 There will be a special session at the 213th AAS meeting on meaning from surveys and population studies (SPS). Until then, it might be useful to pull out some interesting and relevant papers and questions/challenges as a preliminary to the meeting. I will not list astronomical catalogs and surveys only, which are literally countless these days but will bring out some if they change the way how science is performed with a description of the catalog (the best example would be SDSS, Sloan Digital Sky Survey, to my knowledge).

The main focus of the series of postings (I’m not sure how many there will be. There are chances that [SPS] series might be terminated after this season) is introducing some statistical challenges including managing data, opt to be spawn from astronomical surveys and population studies. My paper selection criterion is based on the group discussions from the SPS working group during SAMSI astrostatistics program in 2006 (group leaders were G. Babu, Director of CASt and T. Loredo).

Completeness – I. Revised, reviewed and revived by Johnston, Teodoro, and Hendry
MNRAS, 376(4), pp. 1757-176
Abstract (abridged to the first paragraph) We have extended and improved the statistical test recently developed by Rauzy for assessing the completeness in apparent magnitude of magnitude-redshift surveys. Our improved test statistic retains the robust properties – specifically independence of the spatial distribution of galaxies within a survey – of the Tc statistic introduced in Rauzy’s seminal paper, but now accounts for the presence of both a faint and bright apparent magnitude limit. We demonstrate that a failure to include a bright magnitude limit can significantly affect the performance of Rauzy’s Tc statistic. Moreover, we have also introduced a new test statistic, Tv, defined in terms of the cumulative distance distribution of galaxies within a redshift survey. These test statistics represent powerful tools for identifying and characterizing systematic errors in magnitude-redshift data.

One of the authors was an active participant of the SPS working group at SAMSI. The following three quotes pertain statistically genuine content-wise although the paper was published in MNRAS.

It is straightforward to show from this definition that the random variable η has a uniform distribution on the interval [0,1], and furthermore that η and Z are statistically independent.

If the sample is complete in apparent magnitude, for a given pair of trial magnitude limits, then Tc should be normally distributed with mean zero and variance unity. If, on the other hand, the trial faint (bright) magnitude limit is fainter (brighter) than the true limit, Tc will become systematically negative, due to the systematic departure of the $$\hat{\eta}_i$$ distribution from uniform on the interval [0,1].

If the sample is complete in apparent magnitude, for a given pair of trail magnitude limits, then Tv should be normally distributed with mean zero, and variance unity. If, on the other hand, the trail faint (bright)magnitude limit is fainter (brighter) than the true limit, in either case Tv will become systematically negative, due to the systematic departure of the $$\hat{\tau}_i$$ distribution from uniform on the interval [0,1].

Their statistics is utilized as a diagnostic tool such that the estimate of statistics becomes an indicator of completeness at a given magnitude. Otherwise, asymptotic studies could have been exercised in depth so that people who use their statistics (Tc and Tv) could obtain p-values (for hypothesis testing) and confidence intervals. The authors, however, computed the means and variances and stated that these statistics are standard normal without no rigorous proofs. On the other hand, the process of estimating Tc and Tv statistics is nonparametric so that further statistical inference such as showing that asymptotically Tc and Tv are normal, can be very challenging unless strong assumptions on (probabilistic) models and/or priors are given. Overall, these statistics are more statistically appealing to me in terms of testing completeness compared to other ratio based methods.

Testing completeness now seems not a difficult task due to these statistics, extensive survey catalogs, and better understanding of populations. However, still uncertainties in k-correction, e-correction, and extinction correction make their statistics fuzzy and difficult to interpret results. Changes in statistics due to these uncertainties are hard to be characterized. Furthermore, obtaining good (point) estimators for these correction terms still remains as almost unconquered.

In addition to testing completeness described in the above paper, regarding incompleteness, I’ve seen modeling efforts basically based on the power law, whose slope parameter is an indicator of cosmological models from x-ray astronomy. Unfortunately, incompleteness makes the slope estimation process complex and lots of efforts are found in searching/estimating a model reflecting this incompleteness in observations as a function of redshifts or magnitudes; otherwise, it is fitting a simple ordinary linear regression model with a complete data set.

I believe someday incompleteness will be stochastically modeled (parameterized to draw information and to offer good prediction) beyond testing and will offer better understanding of the visible universe (visible here is a very broad concept, not indicating something only can be seen through naked human eyes). For a while, (in)completeness has been a concept and a word of meaning to which mathematical compactness and statistical modeling has never been attached to test and to understand uncertainties.

p.s. I have been paying lots of attention on citation style; in contrast, you’ve noticed my citations are far from consistency. Two noticeable differences between citation styles of statistics and astronomy are abbreviation of journal names and inclusion of titles. Astronomers’ citation is compact, concise, and same across astronomical journals; on the contrary, statisticians’ citation is lengthy, informative (because of title), and various across statistical and applied statistics journals. MNRAS reminded me something that from a paper written by a very renowned statistician referred a paper from MNRAS but said Monograph National Royal Astronomical Society. I think now you become gracious to my citation style.

[disclaimer] I saw various population studies in astronomy from a broad wavelength range, each of which has different objectives, targets, obstacles, and study designs (even telescopes, detectors, data pipelines, and sampling schemes are different), and (in)completeness studies are designed to reflect those differences. I’m afraid that I’m only reporting a tiny fraction of all efforts related to (in)completeness. Your comments are most welcome. Also, I wish for your posts and comments regarding (in)completeness, volume/magnitude limited sample, survey studies, upper limits, missing values in survey, clustering, spatial distribution, large scale structure, etc in the near future.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/sps-completeness/feed/ 0
Why Gaussianity? http://hea-www.harvard.edu/AstroStat/slog/2008/why-gaussianity/ http://hea-www.harvard.edu/AstroStat/slog/2008/why-gaussianity/#comments Wed, 10 Sep 2008 14:15:03 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=637

Physicists believe that the Gaussian law has been proved in mathematics while mathematicians think that it was experimentally established in physics — Henri Poincare

Couldn’t help writing the quote from this article (subscription required).[1]

Why Gaussianity? by Kim, K. and Shevlyakov, G. (2008) IEEE Signal Processing Magazine, Vol. 25(2), pp. 102-113

It’s been a while since my post, signal processing and bootstrap from IEEE signal processing magazine, described as tutorial style papers on signal processing research and applications. Because of its tutorial style, the magazine delivers most up to date information and applications to people in various disciplines (their citation rate is quite high among scientific fields where data are collected via digitization except astronomy. This statement is solely based on my experience and no proper test was carried out to test this hypothesis). This provoking title, perhaps, will drag attentions about advances in signal processing from astronomers in future.

A historical account on Gaussian distribution, which goes by normal distribution among statisticians is given: de Moivre, before Laplace, found the distribution; Laplace, before Gauss, derived the properties of this distribution. The paper illustrates the derivations by Gauss, Herschel (yes, astronomer), Maxwell (no need to mention his important contribution), and Landon along with these following properties:

  • the convolution of two Gaussian functions is another Gaussian function
  • the Fourier transform of a Gaussian function is another Gaussian function
  • the CLT
  • maximizing entropy
  • minimizing Fisher information

You will find pros and cons about Gaussianity in the concluding remark.

  1. Wikiquote said it’s misattributed. And I don’t know French. My guess could be wrong in matching quotes based on french translations into english. Please, correct me.
]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/why-gaussianity/feed/ 0
Coverage issues in exponential families http://hea-www.harvard.edu/AstroStat/slog/2007/interval-estimation-in-exponential-families/ http://hea-www.harvard.edu/AstroStat/slog/2007/interval-estimation-in-exponential-families/#comments Thu, 16 Aug 2007 20:36:51 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2007/interval-estimation-in-exponential-families/ I’ve been heard so much, without knowing fundamental reasons (most likely physics), about coverage problems from astrostat/phystat groups. This paper might be an interest for those: Interval Estimation in Exponential Families by Brown, Cai,and DasGupta ; Statistica Sinica (2003), 13, pp. 19-49

Abstract summary:
The authors investigated issues in interval estimation of the mean in the exponential family, such as binomial, Poisson, negative binomial, normal, gamma, and a sixth distribution. The poor performance of the Wald interval has been known not only for discrete cases but for nonnormal continuous cases with significant negative bias. Their computation suggested that the equal tailed Jeffreys interval and the likelihood ratio interval are the best alternatives to the Wald interval.

Brief summary of the paper without equations:
The objective of this paper is interval estimation of the mean in the natural exponential family (NEF) with quadratic variance functions (QVF) and the particular focus has given to discrete NEF-QVF families consisting of the binomial, negative binomial, and the Poission distributions. It is well known that the Wald interval for a binomial proportion suffers from a systematic negative bias and oscillation in its coverage probability even for large n and p near 0.5, which seems to arise from the lattice nature and the skewness of the binomial distribution. They exemplified this systematic bias and oscillation with Poisson cases to illustrate the poor and erratic behavior of the Wald interval in lattice problems. They proved the bias expressions of the three discrete NEF-QVF distributions and added a disconcerting graphical illustration of this negative bias.

Interested readers should check the figure 4, where the performances of the Wald, score, likelihood ratio (LR), and Jeffreys intervals were compared. Also, the figure 5 illustrated the limits of those four intervals: LR and Jeffreys’ intervals were indistinguishable. They derived the coverage probabilities of four intervals via Edgeworth expansions. The nonoscillating O(n^-1) terms from the Edgeworth expansions were studied to compare the coverage properties of these four intervals. The figure 6 shows that the Wald interval has serious negative bias, whereas the nonoscillating term in the score interval is positive for all three, binomial, negative binomial, and Poission distributions. The negative bias of the Wald interval is also found from continuous distributions like normal, gamma, and NEF-GHS distributions (Figure 7).

As a conclusion, they reconfirmed their findings like LR and Jeffreys intervals are the best alternative to the Wald interval in terms of the negative bias in the coverage and the length. The Rao score interval has a merit of easy presentations but its performance is inferior to LR and Jeffreys’ intervals although it is better than the Wald interval. Yet, the authors left a room for users that choosing one of these intervals is a personal choice.

[Addendum] I wonder if statistical properties of Gehrels’ confidence limits have been studied after the publication. I’ll try to post findings about the statistics of the Gehrels’ confidence limits, shortly(hopefully).

]]>
http://hea-www.harvard.edu/AstroStat/slog/2007/interval-estimation-in-exponential-families/feed/ 0