#### Goodness-of-fit tests

When it comes to applying statistics for measuring goodness-of-fit, the Pearson χ^{2} test is the dominant player in a race and the Kolmogorov-Smirnoff test statistic trails far behind. Although it seems almost invisible in this race, there are more various non-parametric statistics for testing goodness-of-fit and for comparing the sampling distribution to a reference distribution as legitimate race participants trained by many statisticians. Listing their names probably useful to some astronomers when they find the underlying assumptions for the χ^{2} test do not match the data. Perhaps, some astronomers want to try other nonparametric test statistics other than the K-S test. I’ve seen other test statistics in astronomical journals from time to time. Depending on data and statistical properties, one test statistic could work better than the other; therefore, it’s worthwhile to keep the variety in one’s mind that there are other tests beyond the χ^{2} test goodness-of-fit test statistic.

This is the list I can think of at the moment and each test is linked to wikipedia for more stories.

- Wilkoxson Rank Sum test (also called the Mann–Whitney U, Mann–Whitney–Wilcoxon (MWW), or Wilcoxon–Mann–Whitney test)
- Wilcoxon signed-rank test
- Anderson-Darling test
- Cramer- von Mises test
- Shapiro-Wilks test
- Siegel-Tukey test

Before my updates, I welcome your comments that can grow this list. Also, I’d appreciate if your comment includes an explanation when the nonparametric test of your recommendation works better and a little description of your data characteristics. And don’t forget to get the qq-plot prior to discussing implications of p-values from these test statistics.

## vlk:

I think the reason people hesitate to use these other tests is that over the years Statisticians have rather strongly suggested that these tests not be used for parameter estimation. If one doesn’t use something to estimate the best-fit parameters, one would tend not to use it to verify goodness of fit either, especially since it can’t be used to derive error bars. The K-S test also has the rep of not being very powerful — even the improved versions like Anderson-Darling don’t do as well as chisq. (This is not a mathematical theorem, it’s just my impression developed over many years of using it.)

10-06-2009, 3:25 pm