The AstroStat Slog » cross-validation

[ArXiv] classifying spectra

hlee — Fri, 23 Oct 2009 00:08:07 +0000

[arXiv:stat.ME:0910.2585]
Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications
by Murphy, Dean, and Raftery

Classifying or clustering (or semi supervised learning) spectra is a very challenging problem from collecting statistical-analysis-ready data to reducing the dimensionality without sacrificing complex information in each spectrum. Not only how to estimate spiky (not differentiable) curves via statistically well defined procedures of estimating equations but also how to transform data that match the regularity conditions in statistics is challenging.

Another reason that astrophysics spectroscopic data classification and clustering is more difficult is that observed lines, and their intensities and FWHMs on top of continuum are related to atomic database and latent variables/hyper parameters (distance, rotation, absorption, column density, temperature, metalicity, types, system properties, etc). Frequently it becomes very challenging mixture problem to separate lines and to separate lines from continuum (boundary and identifiability issues). These complexity only appears in astronomy spectroscopic data because we only get indirect or uncontrolled data ruled by physics, as opposed to the the meat species spectra in the paper. These spectroscopic data outside astronomy are rather smooth, observed in controlled wavelength range, and no worries for correcting recession/radial velocity/red shift/extinction/lensing/etc.

Although the most relevant part to astronomers, i.e. spectroscopic data processing is not discussed in this paper, the most important part, statistical learning application to complex curves, spectral data, is well described. Some astronomers with appropriate data would like to try the variable selection strategy and to check out the classification methods in statistics. If it works out, it might save space for storing spectral data and time to collect high resolution spectra. Please, keep in mind that it is not necessary to use the same variable selection strategy. Astronomers can create better working versions for classification and clustering purpose, like Hardness Ratios, often used to reduce the dimensionality of spectral data since low total count spectra are not informative in the full energy (wavelength) range. Curse of dimensionality!.

[ArXiv] Cross Validation

hlee — Wed, 12 Aug 2009 23:03:43 +0000

Statistical Resampling Methods are rather unfamiliar among astronomers. Bootstrapping can be an exception but I felt like it’s still unrepresented. Seeing an recent review paper on cross validation from [arXiv] which describes basic notions in theoretical statistics, I couldn’t resist mentioning it here. Cross validation has been used in various statistical fields such as classification, density estimation, model selection, regression, to name a few.

[arXiv:math.ST:0907.4728]
A survey of cross validation procedures for model selection by Sylvain Arlot

Nonetheless, I’ll not review the paper itself except some quotes:

-CV is a popular strategy for model selection, and algorithm selection.
-Compared to the resubstitution error, CV avoids overfitting because the training sample is independent from the validation sample.
-A noticed in the early 30s by Larson (1931), training an algorithm and evaluating its statistical performance on the same data yields an overoptimistic results.

There are books on statistical resampling methods covering more general topics, not limited to model selection. Instead, I decide to do a little search how CV is used in astronomy. These are the ADS search results. More publications than I expected.

Kernel regression for determining photometric redshifts from Sloan broad-band photometry [arXiv:0706.2704]
Wang, D.; Zhang, Y. X.; Liu, C.; Zhao, Y. H.
Monthly Notices of the Royal Astronomical Society, Volume 382, Issue 4, pp. 1601-1606 (2007)
STECKMAP: STEllar Content and Kinematics from high resolution galactic spectra via Maximum A Posteriori [arXiv:0507002]
Ocvirk, P.; Pichon, C.; Lançon, A.; Thiébaut, E.
Monthly Notices of the Royal Astronomical Society, Volume 365, Issue 1, pp. 74-84 (2006)
STECMAP: STEllar Content from high-resolution galactic spectra via Maximum A Posteriori [arXiv:0505209]
Ocvirk, P.; Pichon, C.; Lançon, A.; Thiébaut, E.
Monthly Notices of the Royal Astronomical Society, Volume 365, Issue 1, pp. 46-73 (2006)
Automated Detection of Classical Novae with Neural Networks [arXiv:0604236]
Feeney, S. M et al.
The Astronomical Journal, Volume 130, Issue 1, pp. 84-94 (2005)
Estimation of regularization parameters in multiple-image deblurring [arxiv:0405545]
Vio, R.et al.
Astronomy and Astrophysics, v.423, p.1179-1186 (2004)
Machine learning and image analysis for morphological galaxy classification
de la Calleja, Jorge and Fuentes, Olac
Monthly Notices of the Royal Astronomical Society, Volume 349, Issue 4, pp. 87-93 (2004)
Ensembles of Classifiers for Morphological Galaxy Classification
Bazell, D.; Aha, David W.
The Astrophysical Journal, Volume 548, Issue 1, pp. 219-223.(2001)
Bayesian image reconstruction with space-variant noise suppression
Nunez, J.; Llacer, J.
Astronomy and Astrophysics Supplement, v.131, p.167-180 (1998)
Estimating the sun’s rotation from solar oscillations by regularisation
Thompson, A. M.
Astronomy and Astrophysics (ISSN 0004-6361), vol. 265, no. 1, p. 289-295. (1992)

One can easily grasp that many adopted CV under the machine learning context. The application of CV, and bootstrapping is not limited to machine learning. As Arlot’s title, CV is used for model selection. When it come to model selection in high energy astrophysics, not CV but reduced chi^2 measures and fitted curve eye balling are the standard procedure. Hopefully, a renovated model selection procedure via CV or other statistically robust strategy soon challenge the reduced chi^2 and eye balling. On the other hand, I doubt that it’ll come soon. Remember, eyes are the best classifier so it won’t be a easy task.

Cross-validation for model selection

hlee — Mon, 20 Aug 2007 03:35:48 +0000

One of the most frequently cited papers in model selection would be An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion by M. Stone, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1 (1977), pp. 44-47.
(Akaike’s 1974 paper, introducing Akaike Information Criterion (AIC), is the most often cited paper in the subject of model selection).

The popularity of AIC comes from its simplicity. By penalizing the log of maximum likelihood with the number of model parameters (p), one can choose the best model that describes/generates the data. Nonetheless, we know that AIC has its shortcoming: all candidate models are nested each other and come from the same parametric family. For an exponential family, the trace of multiplication of score function and Fisher information becomes equivalent to the number of parameters, where you can easily raise a question, “what happens when the trace cannot be obtained analytically?”

The general form of AIC is called TIC (Takeuchi’s information criterion, Takeuchi, 1976), where the penalized term is written as the trace of multiplication of score function and Fisher information. Still, I haven’t answered to the question above.

I personally think that a trick to avoid such dilemma is the key content of Stone (1974), using cross-validation. Stone proved that computing the log likelihood by cross-validation is equivalent to AIC, without computing the score function and Fisher information or getting an exact estimate of the number of parameters. Cross-validation enables to obtain the penalized maximum log likelihoods across models (penalizing is necessary due to estimating the parameters) so that comparison among models for selection becomes feasible while it elevates worries of getting the proper number of parameters (penalization).

Numerous tactics are available for the purpose of model selection. Although variable selection (candidate models are generally nested) is a very hot topic in statistics these days and tones of publication could be found, when it comes to applying resampling methods to model selection, there are not many works. As Stone proved, cross-validation relieves any difficulties of calculating the score function and Fisher information of a model. I was working on non-nested model selection (selecting a best model from different parametric families) with Jackknife with Prof. Babu and Prof. Rao at Penn State until last year (paper hasn’t submitted yet) based on finding that the Jackknife enables to get the unbiased maximum likelihood. Even though high cost of computation compared to cross-validation and the jackknife, the bootstrap has occasionally appeared for model selection.

I’m not sure cross-validation or the jackknife is a feasible approach to be implemented in astronomical softwares, when they compute statistics. Certainly it has advantages when it comes to calculating likelihoods, like Cash statistics.

[ArXiv] Kernel Regression, June 20, 2007

hlee — Mon, 25 Jun 2007 17:27:54 +0000

One of the papers from arxiv/astro-ph discusses kernel regression and model selection to determine photometric redshifts astro-ph/0706.2704. This paper presents their studies on choosing bandwidth of kernels via 10 fold cross-validation, choosing appropriate models from various combination of input parameters through estimating root mean square error and AIC, and evaluating their kernel regression to other regression and classification methods with root mean square errors from literature survey. They made a conclusion of flexibility in kernel regression particularly for data at high z.

Off the topic but worth to be notified:
1. They used AIC for model comparison. In spite of many advocates for BIC, choosing AIC would do a better job for analyzing catalog data (399,929 galaxies) since the penalty term in BIC with huge sample will lead to select the model of most parsimony.

2. Despite that more detailed discussion hasn’t been posted, I’d like to point out photometric redshift studies are more or less regression problems. Whether they use sophisticated and up-to-date classification schemes such as support vector machine (SVM), artificial neural network (ANN), or classical regression methods, the goal of the study in photometric redshifts is finding predictors for right classification and the model from those predictors. I wish there will be some studies on quantile regression, which receive many spotlights recently in economics.

3. Adaptive kernels were mentioned and the results of adaptive kernel regression are highly expected.

4. Comparing root mean square errors from various classification and regression models based on Sloan Digital Sky Survey (SDSS) EDR (Early Data Release) to DR5 (Date Release 5) might mislead the conclusion of choosing the best regression/classification method due to different sample sizes in EDR to DR5. Further formulation, especially asymptotic properties of these root mean square errors will be very useful to make a legitimate comparison among different regression/classification strategies.