Cross-validation for model selection

hlee — Mon, 20 Aug 2007 03:35:48 +0000

One of the most frequently cited papers in model selection would be An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion by M. Stone, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1 (1977), pp. 44-47.
(Akaike’s 1974 paper, introducing Akaike Information Criterion (AIC), is the most often cited paper in the subject of model selection).

The popularity of AIC comes from its simplicity. By penalizing the log of maximum likelihood with the number of model parameters (p), one can choose the best model that describes/generates the data. Nonetheless, we know that AIC has its shortcoming: all candidate models are nested each other and come from the same parametric family. For an exponential family, the trace of multiplication of score function and Fisher information becomes equivalent to the number of parameters, where you can easily raise a question, “what happens when the trace cannot be obtained analytically?”

The general form of AIC is called TIC (Takeuchi’s information criterion, Takeuchi, 1976), where the penalized term is written as the trace of multiplication of score function and Fisher information. Still, I haven’t answered to the question above.

I personally think that a trick to avoid such dilemma is the key content of Stone (1974), using cross-validation. Stone proved that computing the log likelihood by cross-validation is equivalent to AIC, without computing the score function and Fisher information or getting an exact estimate of the number of parameters. Cross-validation enables to obtain the penalized maximum log likelihoods across models (penalizing is necessary due to estimating the parameters) so that comparison among models for selection becomes feasible while it elevates worries of getting the proper number of parameters (penalization).

Numerous tactics are available for the purpose of model selection. Although variable selection (candidate models are generally nested) is a very hot topic in statistics these days and tones of publication could be found, when it comes to applying resampling methods to model selection, there are not many works. As Stone proved, cross-validation relieves any difficulties of calculating the score function and Fisher information of a model. I was working on non-nested model selection (selecting a best model from different parametric families) with Jackknife with Prof. Babu and Prof. Rao at Penn State until last year (paper hasn’t submitted yet) based on finding that the Jackknife enables to get the unbiased maximum likelihood. Even though high cost of computation compared to cross-validation and the jackknife, the bootstrap has occasionally appeared for model selection.

I’m not sure cross-validation or the jackknife is a feasible approach to be implemented in astronomical softwares, when they compute statistics. Certainly it has advantages when it comes to calculating likelihoods, like Cash statistics.

Coverage issues in exponential families

hlee — Thu, 16 Aug 2007 20:36:51 +0000

I’ve been heard so much, without knowing fundamental reasons (most likely physics), about coverage problems from astrostat/phystat groups. This paper might be an interest for those: Interval Estimation in Exponential Families by Brown, Cai,and DasGupta ; Statistica Sinica (2003), 13, pp. 19-49

Abstract summary:
The authors investigated issues in interval estimation of the mean in the exponential family, such as binomial, Poisson, negative binomial, normal, gamma, and a sixth distribution. The poor performance of the Wald interval has been known not only for discrete cases but for nonnormal continuous cases with significant negative bias. Their computation suggested that the equal tailed Jeffreys interval and the likelihood ratio interval are the best alternatives to the Wald interval.

Brief summary of the paper without equations:
The objective of this paper is interval estimation of the mean in the natural exponential family (NEF) with quadratic variance functions (QVF) and the particular focus has given to discrete NEF-QVF families consisting of the binomial, negative binomial, and the Poission distributions. It is well known that the Wald interval for a binomial proportion suffers from a systematic negative bias and oscillation in its coverage probability even for large n and p near 0.5, which seems to arise from the lattice nature and the skewness of the binomial distribution. They exemplified this systematic bias and oscillation with Poisson cases to illustrate the poor and erratic behavior of the Wald interval in lattice problems. They proved the bias expressions of the three discrete NEF-QVF distributions and added a disconcerting graphical illustration of this negative bias.

Interested readers should check the figure 4, where the performances of the Wald, score, likelihood ratio (LR), and Jeffreys intervals were compared. Also, the figure 5 illustrated the limits of those four intervals: LR and Jeffreys’ intervals were indistinguishable. They derived the coverage probabilities of four intervals via Edgeworth expansions. The nonoscillating O(n^-1) terms from the Edgeworth expansions were studied to compare the coverage properties of these four intervals. The figure 6 shows that the Wald interval has serious negative bias, whereas the nonoscillating term in the score interval is positive for all three, binomial, negative binomial, and Poission distributions. The negative bias of the Wald interval is also found from continuous distributions like normal, gamma, and NEF-GHS distributions (Figure 7).

As a conclusion, they reconfirmed their findings like LR and Jeffreys intervals are the best alternative to the Wald interval in terms of the negative bias in the coverage and the length. The Rao score interval has a merit of easy presentations but its performance is inferior to LR and Jeffreys’ intervals although it is better than the Wald interval. Yet, the authors left a room for users that choosing one of these intervals is a personal choice.

[Addendum] I wonder if statistical properties of Gehrels’ confidence limits have been studied after the publication. I’ll try to post findings about the statistics of the Gehrels’ confidence limits, shortly(hopefully).

The AstroStat Slog » exponential family

Cross-validation for model selection

Coverage issues in exponential families