The AstroStat Slog » Binning http://hea-www.harvard.edu/AstroStat/slog Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 http://wordpress.org/?v=3.4 4754 d.f. http://hea-www.harvard.edu/AstroStat/slog/2009/4754-df/ http://hea-www.harvard.edu/AstroStat/slog/2009/4754-df/#comments Tue, 17 Mar 2009 19:37:44 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1840 I couldn’t believe my eyes when I saw 4754 degrees of freedom (d.f.) and chi-square test statistic 4859. I’ve often enough seen large degrees of freedom from journals in astronomy, several hundreds to a few thousands, but I never felt comfortable at these big numbers. Then with a great shock 4754 d.f. appeared. I must find out why I feel so bothered at these huge degrees of freedom.

When I was learning statistics, I never confronted such huge degrees of freedom. Well, given the facts that only a small amount of time is used for learning the chi-square goodness-of-fit test, that the chi-square distribution is a subset of gamma distribution, and that statisticians do not handle a hundred of thousands (there are more low count spectra but I’ll discuss why I chose this big number later) of photons from X-ray telescopes, almost surely no statistician would confront such huge degrees of freedom.

Degrees of freedom in spectral fitting are combined results of binning (or grouping into n classes) and the number of free parameters (p), i.e. n-p-1. Those parameters of interest, targets to be optimized or to be sought for solutions are from physical source models, which are determined by law of physics. Nothing to be discussed from the statistical point of view about these source models except the model selection and assessment side, which seems to be almost unexplored area. On the other hand, I’d like to know more about binning and subsequent degrees of freedom.

A few binning schemes in spectral analysis that I often see are each bin having more than 25 counts (the same notion of 30 in statistics for CLT or the last number in a t-table) or counts in each bin satisfying a certain signal to noise ratio S/N level. For the latter, it is equivalent that sqrt(expected counts) is larger than the given S/N level since photon counts are Poisson distributed. There are more sophisticated adaptive binning strategies but I haven’t found mathematical, statistical, nor computational algorithmic justifications for those. They look empirical procedures to me that are discovered after many trials and errors on particular types of spectra (I often become suspicious if I can reproduce the same goodness of fit results with the same ObsIDs as reported in those publications). The point is that either simple or complex, at the end, if someone has a data file with large number of photons, n is generally larger than observations with sparse photons. This is the reason I happen to see inconceivable d.f.s to a statistician from some papers, like 4754.

First, the chi-square goodness of fit test was designed for agricultural data (or biology considering Pearson’s eugenics) where the sample size is not a scale of scores of thousands. Please, note that bin in astronomy is called cell (class, interval, partition) in statistical papers and books showing applications of chi-square goodness fit tests.

I also like to point out that the chi-square goodness of fit test is different from the chi-square minimization even if they share the same equation. The former is for hypothesis testing and the latter is for optimization (best fit solution). Using the same data for optimization and testing introduces bias. That’s one of the reasons why with large number of data points, cross validation techniques are employed in statistics and machine learning[1]. Since I consider binning as smoothing, the optimal number of bins and their size depends on data quality and source model property as is done in kernel density estimation or imminently various versions of chi-square tests or distance based nonparametric tests (K-S test, for example).

Although published many decades ago, you might want to check this paper out to get a proper rule of thumb for the number of bins:
“On the choice of the number of class intervals in the application of the chi square test” (JSTOR link) by Mann and Wald in The Annals of Mathematical Statistics, Vol. 13, No. 3 (Sep., 1942), pp. 306-317 where they showed that the number of classes is proportional to N^(2/5) (The underlying idea about the chi-square goodness of fit tests, detailed derivation, and exact equation about the number of classes is given in detail) and this is the reason why I chose a spectrum of 10^5 photons at the beginning. By ignoring other factors in the equation, 10^5 counts roughly yields 100 bins. About 4000 bins implies more than a billion photons, which seems a unthinkable number in X-ray spectral analysis. Furthermore, many reports said Mann and Wald’s criterion results in too many bins and loss of powers. So, n is subject to be smaller than 100 for 10^5 photons.

The other issue with statistical analysis on X-ray spectra is that although photons in each channel/bin can be treated as independent sample but the expected numbers of photons across bins are related via physical source model or so called link function borrowed from generalized linear model. However, well studied link functions in statistics do not match source models in high energy astrophysics. Typically, source models are not analytical. They are non-linear, numerical, tabulated, or black box type that are incompatible with current link functions in generalized linear model that is a well developed, diverse, and robust subject in statistics for inference problems. Therefore, binning data and chi-square minimization seems to be an only strategy for statistical inference about parameters in source models so far (for some “specific” statistical or physical models, this is not true, which is not a topic of this discussion). Mann and Wald’s method for class size assumes equiprobable bins whereas channel or bin probabilities in astronomy would not satisfy the condition. The probability vector of multinomial distribution depends on binning, detector sensitivity, and source model instead of the equiprobable constraint from statistics. Well, it is hard to device an purely statistically optimal binning/grouping method for X-ray spectral analysis.

Instead of individual group/bin dependent smoothing (S/N>3 grouping, for example), I, nevertheless, wish for developing binning/grouping schemes based on total sample size N particularly when N is large. I’m afraid that with the current chi-square test embedded in data analysis packages, the power of a chi-square statistic is so small and one will always have a good reduced chi-square value (astronomers’ simple model assessment tool: the measure of chi-square statistic divided by degrees of freedom and its expected value is one. If the reduced chi-square criterion is close to one, then the chosen source model and solution for parameters is considered to be best fit model and value). The fundamental idea of suitable number of bins is equivalent to optimal bandwidth problems in kernel density estimation, of which objective is accentuating the information via smoothing; therefore, methodology developed in the field of kernel density estimation may suggest how to bin/group the spectrum while preserving the most of information and increasing the efficiency. A modified strategy for binning and applying the chi-square test statistic for assessing model adequacy should be conceived instead of reporting thousands of degrees of freedom.

I think I must quit before getting too bored. Only I’d like to mention quite interesting papers that cited Mann and Wald (1942) and explored the chi square goodness of fit including Johnson’s A Bayesian chi-square test for Goodness-of-Fit (a link is made to the arxiv pdf file) which might provide more charm to astronomers who like to modify their chi-square methods in a Bayesian way. A chapter “On the Use and Misuse of Chi-Square” (link to google book excerpt) by KL Delucchi in A Handbook for Data Analysis in the Behavioral Sciences (1993) reads quite intriguing although the discussion is a reminder for behavior scientists.

Lastly, I’m very sure that astronomers explored properties of the chi-square statistic and chi-square type tests with their data sets. I admit that I didn’t make an expedition for such works since those are few needles in a mound of haystack. I’ll be very delighted to see an astronomers’ version of “use and misuse of chi-square,” a statistical account for whether the chi-square test with huge degrees of freedom is powerful enough, or any advice on that matter will be very much appreciated.

  1. a rough sketch of cross validation: assign data into a training data set and a test set. get the bet fit from the training set and evaluate the goodness-of-fit with that best fit with the test set. alternate training and test sets and repeat. wiki:cross_validationa
]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/4754-df/feed/ 2
[ArXiv] 2nd week, June 2008 http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-june-2008/ http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-june-2008/#comments Mon, 16 Jun 2008 14:47:42 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=335 As Prof. Speed said, PCA is prevalent in astronomy, particularly this week. Furthermore, a paper explicitly discusses R, a popular statistics package.

  • [astro-ph:0806.1140] N.Bonhomme, H.M.Courtois, R.B.Tully
        Derivation of Distances with the Tully-Fisher Relation: The Antlia Cluster
    (Tully Fisher relation is well known and one of many occasions statistics could help. On the contrary, astronomical biases as well as measurement errors hinder from the collaboration).
  • [astro-ph:0806.1222] S. Dye
        Star formation histories from multi-band photometry: A new approach (Bayesian evidence)
  • [astro-ph:0806.1232] M. Cara and M. Lister
        Avoiding spurious breaks in binned luminosity functions
    (I think that binning is not always necessary and overdosed, while there are alternatives.)
  • [astro-ph:0806.1326] J.C. Ramirez Velez, A. Lopez Ariste and M. Semel
        Strength distribution of solar magnetic fields in photospheric quiet Sun regions (PCA was utilized)
  • [astro-ph:0806.1487] M.D.Schneider et al.
        Simulations and cosmological inference: A statistical model for power spectra means and covariances
    (They used R and its package Latin hypercube samples, lhs.)
  • [astro-ph:0806.1558] Ivan L. Andronov et al.
        Idling Magnetic White Dwarf in the Synchronizing Polar BY Cam. The Noah-2 Project (PCA is applied)
  • [astro-ph:0806.1880] R. G. Arendt et al.
        Comparison of 3.6 – 8.0 Micron Spitzer/IRAC Galactic Center Survey Point Sources with Chandra X-Ray Point Sources in the Central 40×40 Parsecs (K-S test)
]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-june-2008/feed/ 0
[ArXiv] 1st week, Oct. 2007 http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-1st-week-oct-2007/ http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-1st-week-oct-2007/#comments Sat, 06 Oct 2007 16:45:19 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-1st-week-oct-2007/ This week, instead of only filtering AstroStatistics related papers from arxiv, I chose additional arxiv/astro-ph papers related to CHASC folks’ astrophysical projects. Some of papers you see from this week do not have sophisticated statistical analysis but contain data from specific satellites and possibly relevant information related to CHASC projects. Due to the CHACS’ long history (we are celebrating the 10th birthday this year) and my being a newbie to CHASC, I may not pick up all papers related to the projects of current, former, and future CHASC members and dedicated slog readers. For creating a satisfying posting every week, your inputs are welcome to improve my adaptive filter. For the list of this week, click the following.

  • [astro-ph:0709.4598]
    Upper Limits from Hess Observations of AGN in 2005-2007 by Benbow and Buehler
  • [physics.data-an:0709.3662] provides physical insights toward some families of probability distributions
    Econophysics, Statistical Mechanics Approach to by V.M.Yakovenko
  • [astro-ph:0709.4488] could motivate developing machine learning algorithms.
    Determining the Type, Redshift, and Age of a Supernova Spectrum by S. Blondin and J.L. Tonry
  • [astro-ph:0709.4531]
    A Problem with the Clustering of Recent Measures of the Distance to the Large Magellanic Cloud by B. E. Schaefer
  • [astro-ph:0709.4601]
    Multiple stellar populations in Globular Clusters: collection of information from the Horizontal Branch by F. D’Antona and V. Caloi
  • [astro-ph:0710.0370]
    MegaPipe: the MegaCam image stacking pipeline at the Canadian Astronomical Data Centre by S. D. J. Gwyn
  • [astro-ph:0710.0373]
    To Bin or Not To Bin: Decorrelating the Cosmic Equation of State by R. de Putter and E. V. Linder
  • [astro-ph:0710.0619] About EGRET and GLAST
    Unresolved Unidentified Source Contribution to the Gamma-ray Background by V. Pavlidou et. al.
  • [astro-ph:0710.0757] About SOHO(MDI) and RHESSI
    The Cause of Photospheric and Helioseismic Responses to Solar Flares: High-Energy Electrons or Protons? by A. G. Kosovichev
  • [astro-ph:0710.0774]
    NGC 346 in the Small Magellanic Cloud. III. Recent Star Formation and Stellar Clustering Properties in the Bright HII Region N 66 by E. Hennekemper et.al
  • [astro-ph:0710.0874] discusses GLAST as well.
    Constraints on Galactic populations of gamma-ray emitters from the unidentified EGRET sources by J. M. Siegal-Gaskins et.al.
  • [astro-ph:0710.0875]
    Evidence of Cosmic Evolution of the Stellar Initial Mass Function by P. van Dokkum
]]>
http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-1st-week-oct-2007/feed/ 0