Comments on: Curious Cases of the Null Hypothesis Probability

By: Simon Vaughan

Simon Vaughan — Tue, 09 Jun 2009 08:00:56 +0000

Like so many problems discussed in the Slog, this one is mostly a result of the language barrier between statisticians and astronomers. I was hardly aware of this a few years ago, but can now see a huge chasm between the two! If you use “formal” statistical terms in astronomical papers, you risk being misunderstood, ignored, or treated as “just some statistical technicality’. But if you use astronomical language you risk being misinterpreted by those who do know the proper statistical terms, as well finding it more difficult to connect to work in other fields. It strikes me as something we can all work on: when presenting a statistical method, result, etc., we can use the “formal” terms but make sure it is explained so any astronomer unfamiliar with the term can get it. Even such a common term as “p-value” is missed by many astronomers because they’re so used to calling it by other names.

With regard to the bias, I still don’t see the issue. I know there can be biases involved in least-squares fitting. One source of bias arrises when the data are not Gaussian but Poissonian – however, this should decrease as the count/bin increases. Another source of bias is the choice of “sigma” term in the formula for chi-square – do you use the “data” errors, or the “model” errors, etc.? This is just a choice of which way to approximate the log[likelihood] function, but again the bias should be small for large datasets (but maybe the caveats to this need pointing out?). These are discussed elsewhere in Slog posts.

But I don’t see that there’s an extra bias caused by using the same dataset for (i) estimation (finding best chi-square) and (ii) testing (p-value for GoF test). If one uses the Pearson GoF test with dof=N-M, this should account for the fact that M parameters were estimated in step (i) before the test of step (ii) was applied. (Just like using 1/(N-1) in the definition of sample variance to correct for the fact that the population mean is unknown and we had to use the sample mean as an estimate, i.e. one parameter had to be estimated from the data.) Maybe I should wait for your full post…

By: hlee

hlee — Wed, 03 Jun 2009 01:09:00 +0000

My understanding of the null hypothesis probability is the p-value from the chi-square goodness of fit test. But if the p-value is presented the reader is free to make their own assessment. Personally I prefer your statement than discussing controversial thresholding p-values. I was worried what if there is something more between the notion of p-value and the null hypothesis probability, since the latter is not used in statistical literature. I sometimes see that the null hypothesis probability is interpreted as the quantified chance of observing the best fits (Bayesian) whereas the frequency rate that one will observe such best fits if the experiment is conducted many times under the null hypothesis (frequentist). About the bias from using the data for best fit finding and hypothesis testing, I'm planning to write a post since it's a very dire issue in the topic of model selection. Still I'm in the process of collecting introductory references. Please, check the slog post titled An example of chi2 bias in fitting the x-ray spectra or Figure 9 in Mighell, K.J. (1999) Parameter Estimation in Astronomy with Poisson-Distributed Data.I. the \chi_{\gamma}^2 Statistics ApJ, 518, pp.380-393 My opinion is that the bias in those plots is caused by using same data for estimating (computing best fits) and hypothesis testing (computing the null hypothesis probability). Devising chi-square methods to correct such visible biases seem doable but haven't been attempted. I haven't laid out the problem in a formal fashion. Lastly, I appreciate that you explained how a F-test is executed. Probably, an explicit description of the F-test procedure in astronomical literature is not necessary because it's standardized. Not knowing, I was looking for the hypothesis to be tested and the degrees of freedom. I expected the F-test under the ANOVA (analysis of variance) setting. According to your description, M_2-M_1=1 and the F-test is equivalent to a t-test for testing the free parameter of width is a fixed value or not. Circling between F-test, LRT, and t-test reminds me the holy trinity, LRT, Wald test, and Rao-score test, an irrelevant topic. Thanks for your comment. Whenever I cannot comprehend the usage of the null hypothesis probability and the interpretation of p-values in astronomical literature, I'll come back and read this comment.

By: Simon Vaughan

Simon Vaughan — Tue, 02 Jun 2009 09:18:15 +0000

As usual, it is possible I have missed the point of this post, but I can try to clear-up a few misunderstandings.
The phrase “null hypothesis probability” is used in XSPEC to describe one of its main outputs: the p-value from a chi-square goodness of fit test. I suspect that this slightly misleading phrase is then copied blindly into papers by people who pay little attention to statistical terminology. Is this your conclusion too?

* “high statistical precision of the data coupled with small remaining systematic calibration uncertainties” This seems fairly straightforward to this astronomer. They got a low p-value in the GoF test, but suspect there may be small systematic errors (miscalibration) in the response, which will leave systematic residuals and hence a larger chi-square and smaller p-value (compared to zero systematic errors). It is fairly standard in X-ray astronomy to expect systematic errors at the <5% level. When the data are low quality (few counts) the ‘statistical’ error swamps the ‘systematic’ error (i.e. variance of sampling distribution dominates over bias). So, with reasonably well calibrated detectors, one can only use systematic errors to account for a ‘poor’ fit when the data quality is high. (Still, I would thought p=0.04 was pretty good and not worried much about it).

* “What is the rule of thumb interpretation of p-values or this null hypothesis probability” I guess it’s the same as p-values. These days people seem to prefer presenting the p-value from the GoF test rather than simply “accept”/”reject” from a hypothesis test with a fixed size (e.g. alpha = 0.05). I would agree that this is preferable. I’m not sure anyone uses a fixed p-value threshold for testing, but most people would be happy with 0.14 (from your example) and I suspect few of us would unquestioningly accept a fit that gave p=0.0001. But if the p-value is presented the reader is free to make their own assessment.

* “there should be distinction between the chi square minimization tactic and the chi square goodness of fit test. Using same data for both procedures will introduce bias.” I don’t follow this. Surely the chi-square GoF test is well calibrated — in the sense of returning uniformly distributed p-values when the null hypothesis is true — when the degrees of freedom is N-M (data bins – free parameters). The extra -M term accounts for the bias that would be introduced by minimising the fit statistic with respect to the M parameters. I don’t see why there will be a bias.

* “the reduced chi square is always very close to unity…” I agree this is a bit confusing. Many people use the rule-of-thumb that they should get “red. chi-square approx. 1″ for a ‘good’ fit. This might be traced back at least as far as Bevington (e.g. p68, 3rd ed.), which is a standard book used by physicists/astronomers. Ok, it is a reasonable approximate guideline, but to do the job properly one needs the (absolute) chi-square and DOF value, and then performs the GoF test. With this done there’s nothing to be gained from employing the approximate guideline.

* “a low f-test probability for this suggests that we cannot constrain the width” Again, this seems quite clear to this astronomer. They apparently fitted the data using two models: (1) including an emission line of fixed width (“simple narrow redshifted Gaussian line fit”) and (2) leaving the line width as a free parameter. They then used the F-test as a Likelihood Ratio Test (LRT) to assess whether there is any justification for free or fixed width. (If the data have N bins and the models have M_1 and M_2 free parameters, the two degrees of freedom are M_2-M_1 and N-M for the F-test – although this depends of the way that F is formulated.) From this I see no obvious objection; without more analysis it seems to satisfy the conditions outlined in Protassov et al. (2002). They could very well have simply measure the change in chi-square fit statistic between model (1) and (2), and then calculated a p-value from the tail area in the chi-square distribution with M_2 – M_1 DOF. This would have avoided using F and been slightly closer to the LRT.