The AstroStat Slog » goodness-of-fit test http://hea-www.harvard.edu/AstroStat/slog Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 http://wordpress.org/?v=3.4 Goodness-of-fit tests http://hea-www.harvard.edu/AstroStat/slog/2009/goodness-of-fit-tests/ http://hea-www.harvard.edu/AstroStat/slog/2009/goodness-of-fit-tests/#comments Tue, 06 Oct 2009 18:49:39 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=3748 When it comes to applying statistics for measuring goodness-of-fit, the Pearson χ2 test is the dominant player in a race and the Kolmogorov-Smirnoff test statistic trails far behind. Although it seems almost invisible in this race, there are more various non-parametric statistics for testing goodness-of-fit and for comparing the sampling distribution to a reference distribution as legitimate race participants trained by many statisticians. Listing their names probably useful to some astronomers when they find the underlying assumptions for the χ2 test do not match the data. Perhaps, some astronomers want to try other nonparametric test statistics other than the K-S test. I’ve seen other test statistics in astronomical journals from time to time. Depending on data and statistical properties, one test statistic could work better than the other; therefore, it’s worthwhile to keep the variety in one’s mind that there are other tests beyond the χ2 test goodness-of-fit test statistic.

This is the list I can think of at the moment and each test is linked to wikipedia for more stories.

Before my updates, I welcome your comments that can grow this list. Also, I’d appreciate if your comment includes an explanation when the nonparametric test of your recommendation works better and a little description of your data characteristics. And don’t forget to get the qq-plot prior to discussing implications of p-values from these test statistics.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/goodness-of-fit-tests/feed/ 1
Curious Cases of the Null Hypothesis Probability http://hea-www.harvard.edu/AstroStat/slog/2009/curious-cases-of-the-null-hypothesis-probability/ http://hea-www.harvard.edu/AstroStat/slog/2009/curious-cases-of-the-null-hypothesis-probability/#comments Tue, 02 Jun 2009 08:03:13 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1783 Even though I traced the astronomers’ casual usage of the null hypothesis probability in a fashion of reporting outputs from data analysis packages of their choice, there were still some curious cases of the null hypothesis probability that I couldn’t solve. They are quite mysterious to me. Sometimes too much creativity harms the original intention. Here are some examples.

Full text search in ADS with “null hypothesis probability” yield 77 related articles (link removed. Search results are floating urls, probably?). Many of them contained the phrase “null hypothesis probability” as it is. The rest were in the context of “given the null hypothesis, the probability of …” I’m not sure this ADS search result includes null hypothesis probability written in tables and captions. It’s possible more than 77 could exist. The majority of articles with the “null hypothesis probability” are just reporting numbers from screen outputs from the chosen data analysis system. Discussions and interpretations of these numbers are more focused toward reduced χ2 close to ONE, astronomers’ most favored model selection criterion. Sometimes, I got confused with the goal of their fitting analysis because the driven force is that “make the reduced chi-square closed to one and make residuals look good“. Instead of being used for statistical inferences and measures, a statistic works as an objective function. Numerically (chi-square) or pictorially (residuals) is overshadowed the fundamentals that you observed relatively low number of photons under Poisson distribution and those photons are convolved with complicated instruments. It is possible to underestimated statistically, the reduced chi-sq is off from the unity but based on robust statistics, one still can say the model is a good fit.

Instead of talking about the business of the chi-square method, one thing I wanted to point out from this “null hypothesis probability” investigation is that there was a big presenting style and field distinction between papers of the null hypothesis probability (spectral model fitting) and of given the null hypothesis, the probability of (cosmology). Beyond this casual and personal finding about the style difference, the following quotes despaired me because I couldn’t find answers from statistics.

  • MNRAS, v.340, pp.1261-1268 (2003): The temperature and distribution of gas in CL 0016+16 measured with XMM-Newton (Worrall and Birkinshaw)

    With reduced chi square of 1.09 (chi-sq=859 for 786 d.o.f) the null hypothesis probability is 4 percent but this is likely to result from the high statistical precision of the data coupled with small remaining systematic calibration uncertainties

    I couldn’t understand why p-value=0.04 is associated with high statistical precision of the data coupled with small remaining systematic calibration uncertainties. Is it a polite way to say the chi-square method is wrong due to systematic uncertainty? Or does this mean the stat uncertainty is underestimated due the the correlation with sys uncertainty? Or other than p-value, does the null hypothesis probability has some philosophical meanings? Or … I may go on with strange questions due to the statistical ambiguity of the statement. I’d appreciate any explanation how the p-value (the null hypothesis probability) is associated with the subsequent interpretation.

    Another miscellaneous question is that If the number (the null hypothesis probability) from software packages is unfavorable or uninterpretable, can we attribute such ambiguity to systematical error?

  • MNRAS, v. 345(2),pp.423-428 (2003): Iron K features in the hard X-ray XMM-Newton spectrum of NGC 4151 (Schurch, Warwick, Griffiths, and Sembay)
    The result of these modifications was a significantly improved fit (chi-sq=4859 for 4754 d.o.f). The model fit to the data is shown in Fig. 3 and the best-fitting parameter values for this revised model are listed as Model 2 in Table 1. The null hypothesis probability of this latter model (0.14) indicates that this is a reasonable representation of the spectral data to within the limits of the instrument calibration.

    What is the rule of thumb interpretation of p-values or this null hypothesis probability in astronomy? How one knows that it is reasonable as authors mentioned? How one knows the limits of the instrument calibration and compares quantitatively? How about the degrees of freedom? Some thousands! So large. Even with a million photons, according to the guideline for the number of bins[1] I doubt that using chi-square goodness of fit for data with such large degree of freedom makes the test too conservative. Also, there should be distinction between the chi square minimization tactic and the chi square goodness of fit test. Using same data for both procedures will introduce bias.

  • MNRAS, v. 354, pp.10-24 (2004): Comparing the temperatures of galaxy clusters from hdrodynamical N-body simulations to Chandra and XMM-Newton observations (Mazzotta, Rasia, Moscardini, and Tormen)

    In particular, solid and dashed histograms refer to the fits for which the null hypothesis has a probiliy >5 percent (statistically acceptable fit) or <5 percent (statistically unacceptable fit), respectively. We also notice that the reduced chi square is always very close to unity, except in a few cases where the lower temperature components is at T~2keV, …

    The last statement obscures the interpretation even more to the statement related to what “statistically (un)acceptable fit” really means. The notion of how good a model fits to data and how to test such hypothesis from the statistics standpoint seems different from that of astronomy.

  • MNRAS, v.346(4),pp.1231-1241: X-ray and ultraviolet observations of the dwarf nova VW Hyi in quiescence (Pandel, Córdova, and Howell)

    As can be seen in the null hypothesis probabilities, the cemekl model is in very good agreement with the data.

    The computed null hypothesis probabilities from the table 1 are 8.4, 25.7, 42.2, 1.6, 0.7*, and 13.1 percents (* is the result of MKCFLOW model, the rest are CEMEKL model). Probably, the criterion to declare a good fit is a p-value below 0.01 so that CEMEKL model cannot be rejected but MKCFLOW model can be rejected. Only one MKCFLOW which by accident resulted in a small p-value to say that MKCFLOW is not in agreement but the other choice, CEMEKL model is a good model. Too simplified model selection/assessment procedure. I wonder why CEMEKL was tried with various settings but MKCFLOW was only once. I guess there’s is an astrophysical reason of executing such model comparison study but statistically speaking, it looks like comparing measurement of 5 different kinds of oranges and one apple measured by a same ruler (the null hypothesis probability from the chi-square fitting). From the experimental design viewpoint, this is not well established study.

  • MNRAS, 349, 1267 (2004): Predictions on the high-frequency polarization properties of extragalactic radio sources and implications for polarization measurements of the cosmic microwave background (Tucci et al.)

    The correlation is less clear in the samples at higher frequencies (r~ 0.2 and a null-hypothesis probability of >10^{-2}). However, these results are probably affected by the variability of sources, because we are comparing data taken at different epochs. A strong correlation (r>0.5 and a null-hypothesis probability of <10^{-4}) between 5 and 43 GHz is found for the VLA calibrators, the polarization of which is measured simultaneously at all frequencies.

    I wonder what test statistic has been used to compute those p-values. I wonder if they truly meant p-value>0.01. At this level, most tools offer more precise number so as to make a suitable statement. The p-value (or the “null hypothesis probability”) is for testing whether r=0 or not. Even r is small, 0.2, still one can reject the null hypothesis if the threshold is 0.05. Therefore, >10^{-2} only add ambiguity. I think point estimates are enough to report the existence of weak and rather strong correlations. Otherwise, reporting both p-values and powers seems more appropriate.

  • A&A, 342, 502 (1999): X-ray spectroscopy of the active dM stars: AD Leo and EV Lac
    (S. Sciortino, A. Maggio, F. Favata and S. Orlando)

    This fit yields a value of chi square of 185.1 with 145 υ corresponding to a null-hypothesis probability of 1.4% to give an adequate description of the AD Leo coronal spectrum. In other words the adopted model does not give an acceptable description of available data. The analysis of the uncertainties of the best-fit parameters yields the 90% confidence intervals summarized in Table 5, together with the best-fit parameters. The confidence intervals show that we can only set a lower-limit to the value of the high-temperature. In order to obtain an acceptable fit we have added a third thermal MEKAL component and have repeated the fit leaving the metallicity free to vary. The resulting best-fit model is shown in Fig. 7. The fit formally converges with a value of chi square of 163.0 for 145 υ corresponding to a probability level of ~ 9.0%, but with the hotter component having a “best-fit” value of temperature extremely high (and unrealistic) and essentially unconstrained, as it is shown by the chi square contours in Fig. 8. In summary, the available data constrain the value of metallicity to be lower than solar, and they require the presence of a hot component whose temperature can only be stated to be higher than log (T) = 8.13. Available data do not allow us to discriminate between the (assumed) thermal and a non-thermal nature of this hot component.
    …The fit yields a value of [FORMULA] of 95.2 (for 78 degree of freedom) that corresponds to a null hypothesis probability of 2.9%, i.e. a marginally acceptable fit. The limited statistic of the available spectra does not allow us to attempt a fit with a more complex model.

    After adding MEKAL, why the degree of freedom remains same? Also, what do they mean by the limited statistic of the available spectra?

  • MNRAS348, 529 (2004):Powerful, obscured active galactic nuclei among X-ray hard, optically dim serendipitous Chandra sources (Gandhi, Crawford, Fabian, Johnstone)

    …, but a low f-test probability for this suggests that we cannot constrain the width with the current data.
    While the rest frame equivalent width of the line is close to 1keV, its significance is marginal (f-test gives a null hypothesis probability of 0.1).

    Without a contingency table, nor comparing models, I was not sure how they executed the F-test. I could not find two degrees of freedom for the F-test. From the XSPEC’s account for the F-test (http://heasarc.gsfc.nasa.gov/docs/xanadu/xspec/manual/XSftest.html), we see two degrees of freedom, without them, no probability can be computed. Their usage of the F-test seems unconventional. The conventional application of the F-test is for comparing effects of multiple treatments (different levels of drug dosage including placebo); otherwise, it’s just a chi square goodness of fit test or t-test.

  • Another occasion I came across is interpreting the null hypothesis probability of 0.99 as an indicator of a good fit; well, it’s overfitting. Not only too small null hypothesis probability but also close to one null hypothesis probability should raise a flag for cautions and warnings because the later indicating you are overdoing (too many free parameters for example).

There are some residuals of ambiguity after deducing the definition of the null hypothesis probability by playing with XSPEC and finding cases how this null hypothesis probability is used in literature. Authors sometimes added creative comments in order to interpret the null hypothesis probability from their data analysis, which I cannot understand without statistical imagination. Most can be overlooked, perhaps. Or instead, they are rather to be addressed to astronomers with statistical knowledge to resolve my confusion by the null hypothesis probability. I expect comments on how to view these quotes with statistical rigor from astronomers. The listed are personal. There are some more I really didn’t understand the points but many were straightforward in using the null hypothesis probabilities as p-values in statistical inference under the simple null hypothesis. I just listed some to display my first impression on these quotes most of which I couldn’t draw statistical caricatures out of them. Eventually, I hope some astronomers straighten the meaning and the usage of the null hypothesis probability without overruling basics in statistics.

I only want to add a caution when using the reduced chi-square as a model selection criteria. An indicator of a good-fit from a reduced chi^2 close to unity is only true when grouped data are independent so that the formula of degrees of freedom, roughly, the number of groups minus the number of free parameters, is valid. Personally I doubt this rule applied in spectral fitting that one cannot expect independence between two neighboring bins. In other words, given a source model and given total counts, two neighboring observations (counts in two groups) are correlated. The grouping rules like >25 or S/N>3 do not guarantee the independent assumption for the chi-square goodness of fit test although it may sufficient for Gaussian approximation. Statisticians devised various penalty terms and regularization methods for model selection that suits data types. One way to look is computing proper degrees of freedom, called effective degrees of freedom instead of n-p, to reflect the correlation across groups because of the chosen source model and calibration information. With a large number of counts or large number of groups, unless properly penalized, it is likely that the chi-square fit is hard to reject the null hypothesis than a statistic with smaller degrees of freedom because of the curse of dimensionality.

  1. Mann and Wald (1942), “On the Choice of the Number of Class Intervals in the Application of the Chi-square Test” Annals of Math. Stat. vol. 13, pp.306-7.
]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/curious-cases-of-the-null-hypothesis-probability/feed/ 3
4754 d.f. http://hea-www.harvard.edu/AstroStat/slog/2009/4754-df/ http://hea-www.harvard.edu/AstroStat/slog/2009/4754-df/#comments Tue, 17 Mar 2009 19:37:44 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1840 I couldn’t believe my eyes when I saw 4754 degrees of freedom (d.f.) and chi-square test statistic 4859. I’ve often enough seen large degrees of freedom from journals in astronomy, several hundreds to a few thousands, but I never felt comfortable at these big numbers. Then with a great shock 4754 d.f. appeared. I must find out why I feel so bothered at these huge degrees of freedom.

When I was learning statistics, I never confronted such huge degrees of freedom. Well, given the facts that only a small amount of time is used for learning the chi-square goodness-of-fit test, that the chi-square distribution is a subset of gamma distribution, and that statisticians do not handle a hundred of thousands (there are more low count spectra but I’ll discuss why I chose this big number later) of photons from X-ray telescopes, almost surely no statistician would confront such huge degrees of freedom.

Degrees of freedom in spectral fitting are combined results of binning (or grouping into n classes) and the number of free parameters (p), i.e. n-p-1. Those parameters of interest, targets to be optimized or to be sought for solutions are from physical source models, which are determined by law of physics. Nothing to be discussed from the statistical point of view about these source models except the model selection and assessment side, which seems to be almost unexplored area. On the other hand, I’d like to know more about binning and subsequent degrees of freedom.

A few binning schemes in spectral analysis that I often see are each bin having more than 25 counts (the same notion of 30 in statistics for CLT or the last number in a t-table) or counts in each bin satisfying a certain signal to noise ratio S/N level. For the latter, it is equivalent that sqrt(expected counts) is larger than the given S/N level since photon counts are Poisson distributed. There are more sophisticated adaptive binning strategies but I haven’t found mathematical, statistical, nor computational algorithmic justifications for those. They look empirical procedures to me that are discovered after many trials and errors on particular types of spectra (I often become suspicious if I can reproduce the same goodness of fit results with the same ObsIDs as reported in those publications). The point is that either simple or complex, at the end, if someone has a data file with large number of photons, n is generally larger than observations with sparse photons. This is the reason I happen to see inconceivable d.f.s to a statistician from some papers, like 4754.

First, the chi-square goodness of fit test was designed for agricultural data (or biology considering Pearson’s eugenics) where the sample size is not a scale of scores of thousands. Please, note that bin in astronomy is called cell (class, interval, partition) in statistical papers and books showing applications of chi-square goodness fit tests.

I also like to point out that the chi-square goodness of fit test is different from the chi-square minimization even if they share the same equation. The former is for hypothesis testing and the latter is for optimization (best fit solution). Using the same data for optimization and testing introduces bias. That’s one of the reasons why with large number of data points, cross validation techniques are employed in statistics and machine learning[1]. Since I consider binning as smoothing, the optimal number of bins and their size depends on data quality and source model property as is done in kernel density estimation or imminently various versions of chi-square tests or distance based nonparametric tests (K-S test, for example).

Although published many decades ago, you might want to check this paper out to get a proper rule of thumb for the number of bins:
“On the choice of the number of class intervals in the application of the chi square test” (JSTOR link) by Mann and Wald in The Annals of Mathematical Statistics, Vol. 13, No. 3 (Sep., 1942), pp. 306-317 where they showed that the number of classes is proportional to N^(2/5) (The underlying idea about the chi-square goodness of fit tests, detailed derivation, and exact equation about the number of classes is given in detail) and this is the reason why I chose a spectrum of 10^5 photons at the beginning. By ignoring other factors in the equation, 10^5 counts roughly yields 100 bins. About 4000 bins implies more than a billion photons, which seems a unthinkable number in X-ray spectral analysis. Furthermore, many reports said Mann and Wald’s criterion results in too many bins and loss of powers. So, n is subject to be smaller than 100 for 10^5 photons.

The other issue with statistical analysis on X-ray spectra is that although photons in each channel/bin can be treated as independent sample but the expected numbers of photons across bins are related via physical source model or so called link function borrowed from generalized linear model. However, well studied link functions in statistics do not match source models in high energy astrophysics. Typically, source models are not analytical. They are non-linear, numerical, tabulated, or black box type that are incompatible with current link functions in generalized linear model that is a well developed, diverse, and robust subject in statistics for inference problems. Therefore, binning data and chi-square minimization seems to be an only strategy for statistical inference about parameters in source models so far (for some “specific” statistical or physical models, this is not true, which is not a topic of this discussion). Mann and Wald’s method for class size assumes equiprobable bins whereas channel or bin probabilities in astronomy would not satisfy the condition. The probability vector of multinomial distribution depends on binning, detector sensitivity, and source model instead of the equiprobable constraint from statistics. Well, it is hard to device an purely statistically optimal binning/grouping method for X-ray spectral analysis.

Instead of individual group/bin dependent smoothing (S/N>3 grouping, for example), I, nevertheless, wish for developing binning/grouping schemes based on total sample size N particularly when N is large. I’m afraid that with the current chi-square test embedded in data analysis packages, the power of a chi-square statistic is so small and one will always have a good reduced chi-square value (astronomers’ simple model assessment tool: the measure of chi-square statistic divided by degrees of freedom and its expected value is one. If the reduced chi-square criterion is close to one, then the chosen source model and solution for parameters is considered to be best fit model and value). The fundamental idea of suitable number of bins is equivalent to optimal bandwidth problems in kernel density estimation, of which objective is accentuating the information via smoothing; therefore, methodology developed in the field of kernel density estimation may suggest how to bin/group the spectrum while preserving the most of information and increasing the efficiency. A modified strategy for binning and applying the chi-square test statistic for assessing model adequacy should be conceived instead of reporting thousands of degrees of freedom.

I think I must quit before getting too bored. Only I’d like to mention quite interesting papers that cited Mann and Wald (1942) and explored the chi square goodness of fit including Johnson’s A Bayesian chi-square test for Goodness-of-Fit (a link is made to the arxiv pdf file) which might provide more charm to astronomers who like to modify their chi-square methods in a Bayesian way. A chapter “On the Use and Misuse of Chi-Square” (link to google book excerpt) by KL Delucchi in A Handbook for Data Analysis in the Behavioral Sciences (1993) reads quite intriguing although the discussion is a reminder for behavior scientists.

Lastly, I’m very sure that astronomers explored properties of the chi-square statistic and chi-square type tests with their data sets. I admit that I didn’t make an expedition for such works since those are few needles in a mound of haystack. I’ll be very delighted to see an astronomers’ version of “use and misuse of chi-square,” a statistical account for whether the chi-square test with huge degrees of freedom is powerful enough, or any advice on that matter will be very much appreciated.

  1. a rough sketch of cross validation: assign data into a training data set and a test set. get the bet fit from the training set and evaluate the goodness-of-fit with that best fit with the test set. alternate training and test sets and repeat. wiki:cross_validationa
]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/4754-df/feed/ 2
[ArXiv] Data-Driven Goodness-of-Fit Tests, Aug. 1, 2007 http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-data-driven-goodness-of-fit-tests/ http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-data-driven-goodness-of-fit-tests/#comments Fri, 17 Aug 2007 23:37:51 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-data-driven-goodness-of-fit-tests-aug-1-2007/ From arxiv/math.st:0708.0169v1
Data-Driven Goodness-of-Fit Tests by L. Mikhail

Goodness-of-Fit tests have been essential in astronomy to validate the chosen physical model to observed data whereas the limits of these tests have not been taken into consideration carefully when observed data were put into the model for estimating the model parameters. Therefore, I thought this paper would be helpful to have a thought on the different point of views between the astronomers’ practice of goodness-of-fit tests and the statisticians’ constructing tests. (Warning: the paper is abstract and theoretical.)

This paper began with presenting two approaches to constructing test statistics: 1. some measure of distance between the theoretical and empirical distributions like the Cramer-von Mises and the Komogorov-Smirnov statistics and 2. score test statistics, constructed in a way that the tests is asymptotically normal. As the second approach is preferred, the author confined his study to generalize the theory of score tests. The notion of the Neyman type (NT) test was introduced with very minimal assumptions to shape the statistics.

The author discussed the statistical inverse problems or the deconvolution problems of physics, seismology, optics, and imaging where noisy signals and measurements occur. These inverse problems induce the Neyman’s type statistics under appropriate regularity assumptions.

Other type of NT tests in terms of score functions and their consistency was presented in an abstract fashion.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-data-driven-goodness-of-fit-tests/feed/ 1