Curious Cases of the Null Hypothesis Probability

Even though I traced the astronomers’ casual usage of the null hypothesis probability in a fashion of reporting outputs from data analysis packages of their choice, there were still some curious cases of the null hypothesis probability that I couldn’t solve. They are quite mysterious to me. Sometimes too much creativity harms the original intention. Here are some examples.

Full text search in ADS with “null hypothesis probability” yield 77 related articles (link removed. Search results are floating urls, probably?). Many of them contained the phrase “null hypothesis probability” as it is. The rest were in the context of “given the null hypothesis, the probability of …” I’m not sure this ADS search result includes null hypothesis probability written in tables and captions. It’s possible more than 77 could exist. The majority of articles with the “null hypothesis probability” are just reporting numbers from screen outputs from the chosen data analysis system. Discussions and interpretations of these numbers are more focused toward reduced χ2 close to ONE, astronomers’ most favored model selection criterion. Sometimes, I got confused with the goal of their fitting analysis because the driven force is that “make the reduced chi-square closed to one and make residuals look good“. Instead of being used for statistical inferences and measures, a statistic works as an objective function. Numerically (chi-square) or pictorially (residuals) is overshadowed the fundamentals that you observed relatively low number of photons under Poisson distribution and those photons are convolved with complicated instruments. It is possible to underestimated statistically, the reduced chi-sq is off from the unity but based on robust statistics, one still can say the model is a good fit.

Instead of talking about the business of the chi-square method, one thing I wanted to point out from this “null hypothesis probability” investigation is that there was a big presenting style and field distinction between papers of the null hypothesis probability (spectral model fitting) and of given the null hypothesis, the probability of (cosmology). Beyond this casual and personal finding about the style difference, the following quotes despaired me because I couldn’t find answers from statistics.

  • MNRAS, v.340, pp.1261-1268 (2003): The temperature and distribution of gas in CL 0016+16 measured with XMM-Newton (Worrall and Birkinshaw)

    With reduced chi square of 1.09 (chi-sq=859 for 786 d.o.f) the null hypothesis probability is 4 percent but this is likely to result from the high statistical precision of the data coupled with small remaining systematic calibration uncertainties

    I couldn’t understand why p-value=0.04 is associated with high statistical precision of the data coupled with small remaining systematic calibration uncertainties. Is it a polite way to say the chi-square method is wrong due to systematic uncertainty? Or does this mean the stat uncertainty is underestimated due the the correlation with sys uncertainty? Or other than p-value, does the null hypothesis probability has some philosophical meanings? Or … I may go on with strange questions due to the statistical ambiguity of the statement. I’d appreciate any explanation how the p-value (the null hypothesis probability) is associated with the subsequent interpretation.

    Another miscellaneous question is that If the number (the null hypothesis probability) from software packages is unfavorable or uninterpretable, can we attribute such ambiguity to systematical error?

  • MNRAS, v. 345(2),pp.423-428 (2003): Iron K features in the hard X-ray XMM-Newton spectrum of NGC 4151 (Schurch, Warwick, Griffiths, and Sembay)
    The result of these modifications was a significantly improved fit (chi-sq=4859 for 4754 d.o.f). The model fit to the data is shown in Fig. 3 and the best-fitting parameter values for this revised model are listed as Model 2 in Table 1. The null hypothesis probability of this latter model (0.14) indicates that this is a reasonable representation of the spectral data to within the limits of the instrument calibration.

    What is the rule of thumb interpretation of p-values or this null hypothesis probability in astronomy? How one knows that it is reasonable as authors mentioned? How one knows the limits of the instrument calibration and compares quantitatively? How about the degrees of freedom? Some thousands! So large. Even with a million photons, according to the guideline for the number of bins[1] I doubt that using chi-square goodness of fit for data with such large degree of freedom makes the test too conservative. Also, there should be distinction between the chi square minimization tactic and the chi square goodness of fit test. Using same data for both procedures will introduce bias.

  • MNRAS, v. 354, pp.10-24 (2004): Comparing the temperatures of galaxy clusters from hdrodynamical N-body simulations to Chandra and XMM-Newton observations (Mazzotta, Rasia, Moscardini, and Tormen)

    In particular, solid and dashed histograms refer to the fits for which the null hypothesis has a probiliy >5 percent (statistically acceptable fit) or <5 percent (statistically unacceptable fit), respectively. We also notice that the reduced chi square is always very close to unity, except in a few cases where the lower temperature components is at T~2keV, …

    The last statement obscures the interpretation even more to the statement related to what “statistically (un)acceptable fit” really means. The notion of how good a model fits to data and how to test such hypothesis from the statistics standpoint seems different from that of astronomy.

  • MNRAS, v.346(4),pp.1231-1241: X-ray and ultraviolet observations of the dwarf nova VW Hyi in quiescence (Pandel, C√≥rdova, and Howell)

    As can be seen in the null hypothesis probabilities, the cemekl model is in very good agreement with the data.

    The computed null hypothesis probabilities from the table 1 are 8.4, 25.7, 42.2, 1.6, 0.7*, and 13.1 percents (* is the result of MKCFLOW model, the rest are CEMEKL model). Probably, the criterion to declare a good fit is a p-value below 0.01 so that CEMEKL model cannot be rejected but MKCFLOW model can be rejected. Only one MKCFLOW which by accident resulted in a small p-value to say that MKCFLOW is not in agreement but the other choice, CEMEKL model is a good model. Too simplified model selection/assessment procedure. I wonder why CEMEKL was tried with various settings but MKCFLOW was only once. I guess there’s is an astrophysical reason of executing such model comparison study but statistically speaking, it looks like comparing measurement of 5 different kinds of oranges and one apple measured by a same ruler (the null hypothesis probability from the chi-square fitting). From the experimental design viewpoint, this is not well established study.

  • MNRAS, 349, 1267 (2004): Predictions on the high-frequency polarization properties of extragalactic radio sources and implications for polarization measurements of the cosmic microwave background (Tucci et al.)

    The correlation is less clear in the samples at higher frequencies (r~ 0.2 and a null-hypothesis probability of >10^{-2}). However, these results are probably affected by the variability of sources, because we are comparing data taken at different epochs. A strong correlation (r>0.5 and a null-hypothesis probability of <10^{-4}) between 5 and 43 GHz is found for the VLA calibrators, the polarization of which is measured simultaneously at all frequencies.

    I wonder what test statistic has been used to compute those p-values. I wonder if they truly meant p-value>0.01. At this level, most tools offer more precise number so as to make a suitable statement. The p-value (or the “null hypothesis probability”) is for testing whether r=0 or not. Even r is small, 0.2, still one can reject the null hypothesis if the threshold is 0.05. Therefore, >10^{-2} only add ambiguity. I think point estimates are enough to report the existence of weak and rather strong correlations. Otherwise, reporting both p-values and powers seems more appropriate.

  • A&A, 342, 502 (1999): X-ray spectroscopy of the active dM stars: AD Leo and EV Lac
    (S. Sciortino, A. Maggio, F. Favata and S. Orlando)

    This fit yields a value of chi square of 185.1 with 145 υ corresponding to a null-hypothesis probability of 1.4% to give an adequate description of the AD Leo coronal spectrum. In other words the adopted model does not give an acceptable description of available data. The analysis of the uncertainties of the best-fit parameters yields the 90% confidence intervals summarized in Table 5, together with the best-fit parameters. The confidence intervals show that we can only set a lower-limit to the value of the high-temperature. In order to obtain an acceptable fit we have added a third thermal MEKAL component and have repeated the fit leaving the metallicity free to vary. The resulting best-fit model is shown in Fig. 7. The fit formally converges with a value of chi square of 163.0 for 145 υ corresponding to a probability level of ~ 9.0%, but with the hotter component having a “best-fit” value of temperature extremely high (and unrealistic) and essentially unconstrained, as it is shown by the chi square contours in Fig. 8. In summary, the available data constrain the value of metallicity to be lower than solar, and they require the presence of a hot component whose temperature can only be stated to be higher than log (T) = 8.13. Available data do not allow us to discriminate between the (assumed) thermal and a non-thermal nature of this hot component.
    …The fit yields a value of [FORMULA] of 95.2 (for 78 degree of freedom) that corresponds to a null hypothesis probability of 2.9%, i.e. a marginally acceptable fit. The limited statistic of the available spectra does not allow us to attempt a fit with a more complex model.

    After adding MEKAL, why the degree of freedom remains same? Also, what do they mean by the limited statistic of the available spectra?

  • MNRAS348, 529 (2004):Powerful, obscured active galactic nuclei among X-ray hard, optically dim serendipitous Chandra sources (Gandhi, Crawford, Fabian, Johnstone)

    …, but a low f-test probability for this suggests that we cannot constrain the width with the current data.
    While the rest frame equivalent width of the line is close to 1keV, its significance is marginal (f-test gives a null hypothesis probability of 0.1).

    Without a contingency table, nor comparing models, I was not sure how they executed the F-test. I could not find two degrees of freedom for the F-test. From the XSPEC’s account for the F-test (, we see two degrees of freedom, without them, no probability can be computed. Their usage of the F-test seems unconventional. The conventional application of the F-test is for comparing effects of multiple treatments (different levels of drug dosage including placebo); otherwise, it’s just a chi square goodness of fit test or t-test.

  • Another occasion I came across is interpreting the null hypothesis probability of 0.99 as an indicator of a good fit; well, it’s overfitting. Not only too small null hypothesis probability but also close to one null hypothesis probability should raise a flag for cautions and warnings because the later indicating you are overdoing (too many free parameters for example).

There are some residuals of ambiguity after deducing the definition of the null hypothesis probability by playing with XSPEC and finding cases how this null hypothesis probability is used in literature. Authors sometimes added creative comments in order to interpret the null hypothesis probability from their data analysis, which I cannot understand without statistical imagination. Most can be overlooked, perhaps. Or instead, they are rather to be addressed to astronomers with statistical knowledge to resolve my confusion by the null hypothesis probability. I expect comments on how to view these quotes with statistical rigor from astronomers. The listed are personal. There are some more I really didn’t understand the points but many were straightforward in using the null hypothesis probabilities as p-values in statistical inference under the simple null hypothesis. I just listed some to display my first impression on these quotes most of which I couldn’t draw statistical caricatures out of them. Eventually, I hope some astronomers straighten the meaning and the usage of the null hypothesis probability without overruling basics in statistics.

I only want to add a caution when using the reduced chi-square as a model selection criteria. An indicator of a good-fit from a reduced chi^2 close to unity is only true when grouped data are independent so that the formula of degrees of freedom, roughly, the number of groups minus the number of free parameters, is valid. Personally I doubt this rule applied in spectral fitting that one cannot expect independence between two neighboring bins. In other words, given a source model and given total counts, two neighboring observations (counts in two groups) are correlated. The grouping rules like >25 or S/N>3 do not guarantee the independent assumption for the chi-square goodness of fit test although it may sufficient for Gaussian approximation. Statisticians devised various penalty terms and regularization methods for model selection that suits data types. One way to look is computing proper degrees of freedom, called effective degrees of freedom instead of n-p, to reflect the correlation across groups because of the chosen source model and calibration information. With a large number of counts or large number of groups, unless properly penalized, it is likely that the chi-square fit is hard to reject the null hypothesis than a statistic with smaller degrees of freedom because of the curse of dimensionality.

  1. Mann and Wald (1942), “On the Choice of the Number of Class Intervals in the Application of the Chi-square Test” Annals of Math. Stat. vol. 13, pp.306-7.[]
  1. Simon Vaughan:

    As usual, it is possible I have missed the point of this post, but I can try to clear-up a few misunderstandings.
    The phrase “null hypothesis probability” is used in XSPEC to describe one of its main outputs: the p-value from a chi-square goodness of fit test. I suspect that this slightly misleading phrase is then copied blindly into papers by people who pay little attention to statistical terminology. Is this your conclusion too?

    * “high statistical precision of the data coupled with small remaining systematic calibration uncertainties” This seems fairly straightforward to this astronomer. They got a low p-value in the GoF test, but suspect there may be small systematic errors (miscalibration) in the response, which will leave systematic residuals and hence a larger chi-square and smaller p-value (compared to zero systematic errors). It is fairly standard in X-ray astronomy to expect systematic errors at the <5% level. When the data are low quality (few counts) the ‘statistical’ error swamps the ‘systematic’ error (i.e. variance of sampling distribution dominates over bias). So, with reasonably well calibrated detectors, one can only use systematic errors to account for a ‘poor’ fit when the data quality is high. (Still, I would thought p=0.04 was pretty good and not worried much about it).

    * “What is the rule of thumb interpretation of p-values or this null hypothesis probability” I guess it’s the same as p-values. These days people seem to prefer presenting the p-value from the GoF test rather than simply “accept”/”reject” from a hypothesis test with a fixed size (e.g. alpha = 0.05). I would agree that this is preferable. I’m not sure anyone uses a fixed p-value threshold for testing, but most people would be happy with 0.14 (from your example) and I suspect few of us would unquestioningly accept a fit that gave p=0.0001. But if the p-value is presented the reader is free to make their own assessment.

    * “there should be distinction between the chi square minimization tactic and the chi square goodness of fit test. Using same data for both procedures will introduce bias.” I don’t follow this. Surely the chi-square GoF test is well calibrated — in the sense of returning uniformly distributed p-values when the null hypothesis is true — when the degrees of freedom is N-M (data bins – free parameters). The extra -M term accounts for the bias that would be introduced by minimising the fit statistic with respect to the M parameters. I don’t see why there will be a bias.

    * “the reduced chi square is always very close to unity…” I agree this is a bit confusing. Many people use the rule-of-thumb that they should get “red. chi-square approx. 1″ for a ‘good’ fit. This might be traced back at least as far as Bevington (e.g. p68, 3rd ed.), which is a standard book used by physicists/astronomers. Ok, it is a reasonable approximate guideline, but to do the job properly one needs the (absolute) chi-square and DOF value, and then performs the GoF test. With this done there’s nothing to be gained from employing the approximate guideline.

    * “a low f-test probability for this suggests that we cannot constrain the width” Again, this seems quite clear to this astronomer. They apparently fitted the data using two models: (1) including an emission line of fixed width (“simple narrow redshifted Gaussian line fit”) and (2) leaving the line width as a free parameter. They then used the F-test as a Likelihood Ratio Test (LRT) to assess whether there is any justification for free or fixed width. (If the data have N bins and the models have M_1 and M_2 free parameters, the two degrees of freedom are M_2-M_1 and N-M for the F-test – although this depends of the way that F is formulated.) From this I see no obvious objection; without more analysis it seems to satisfy the conditions outlined in Protassov et al. (2002). They could very well have simply measure the change in chi-square fit statistic between model (1) and (2), and then calculated a p-value from the tail area in the chi-square distribution with M_2 – M_1 DOF. This would have avoided using F and been slightly closer to the LRT.

    06-02-2009, 4:18 am
  2. hlee:

    My understanding of the null hypothesis probability is the p-value from the chi-square goodness of fit test.

    But if the p-value is presented the reader is free to make their own assessment.
    Personally I prefer your statement than discussing controversial thresholding p-values. I was worried what if there is something more between the notion of p-value and the null hypothesis probability, since the latter is not used in statistical literature. I sometimes see that the null hypothesis probability is interpreted as the quantified chance of observing the best fits (Bayesian) whereas the frequency rate that one will observe such best fits if the experiment is conducted many times under the null hypothesis (frequentist).

    About the bias from using the data for best fit finding and hypothesis testing, I’m planning to write a post since it’s a very dire issue in the topic of model selection. Still I’m in the process of collecting introductory references. Please, check the slog post titled
    An example of chi2 bias in fitting the x-ray spectra or Figure 9 in Mighell, K.J. (1999) Parameter Estimation in Astronomy with Poisson-Distributed Data.I. the \chi_{\gamma}^2 Statistics ApJ, 518, pp.380-393

    My opinion is that the bias in those plots is caused by using same data for estimating (computing best fits) and hypothesis testing (computing the null hypothesis probability). Devising chi-square methods to correct such visible biases seem doable but haven’t been attempted. I haven’t laid out the problem in a formal fashion.

    Lastly, I appreciate that you explained how a F-test is executed. Probably, an explicit description of the F-test procedure in astronomical literature is not necessary because it’s standardized. Not knowing, I was looking for the hypothesis to be tested and the degrees of freedom. I expected the F-test under the ANOVA (analysis of variance) setting. According to your description, M_2-M_1=1 and the F-test is equivalent to a t-test for testing the free parameter of width is a fixed value or not. Circling between F-test, LRT, and t-test reminds me the holy trinity, LRT, Wald test, and Rao-score test, an irrelevant topic.

    Thanks for your comment. Whenever I cannot comprehend the usage of the null hypothesis probability and the interpretation of p-values in astronomical literature, I’ll come back and read this comment.

    06-02-2009, 8:09 pm
  3. Simon Vaughan:

    Like so many problems discussed in the Slog, this one is mostly a result of the language barrier between statisticians and astronomers. I was hardly aware of this a few years ago, but can now see a huge chasm between the two! If you use “formal” statistical terms in astronomical papers, you risk being misunderstood, ignored, or treated as “just some statistical technicality’. But if you use astronomical language you risk being misinterpreted by those who do know the proper statistical terms, as well finding it more difficult to connect to work in other fields. It strikes me as something we can all work on: when presenting a statistical method, result, etc., we can use the “formal” terms but make sure it is explained so any astronomer unfamiliar with the term can get it. Even such a common term as “p-value” is missed by many astronomers because they’re so used to calling it by other names.

    With regard to the bias, I still don’t see the issue. I know there can be biases involved in least-squares fitting. One source of bias arrises when the data are not Gaussian but Poissonian – however, this should decrease as the count/bin increases. Another source of bias is the choice of “sigma” term in the formula for chi-square – do you use the “data” errors, or the “model” errors, etc.? This is just a choice of which way to approximate the log[likelihood] function, but again the bias should be small for large datasets (but maybe the caveats to this need pointing out?). These are discussed elsewhere in Slog posts.

    But I don’t see that there’s an extra bias caused by using the same dataset for (i) estimation (finding best chi-square) and (ii) testing (p-value for GoF test). If one uses the Pearson GoF test with dof=N-M, this should account for the fact that M parameters were estimated in step (i) before the test of step (ii) was applied. (Just like using 1/(N-1) in the definition of sample variance to correct for the fact that the population mean is unknown and we had to use the sample mean as an estimate, i.e. one parameter had to be estimated from the data.) Maybe I should wait for your full post…

    06-09-2009, 3:00 am
Leave a comment