The AstroStat Slog » BIC

[ArXiv] classifying spectra

hlee — Fri, 23 Oct 2009 00:08:07 +0000

[arXiv:stat.ME:0910.2585]
Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications
by Murphy, Dean, and Raftery

Classifying or clustering (or semi supervised learning) spectra is a very challenging problem from collecting statistical-analysis-ready data to reducing the dimensionality without sacrificing complex information in each spectrum. Not only how to estimate spiky (not differentiable) curves via statistically well defined procedures of estimating equations but also how to transform data that match the regularity conditions in statistics is challenging.

Another reason that astrophysics spectroscopic data classification and clustering is more difficult is that observed lines, and their intensities and FWHMs on top of continuum are related to atomic database and latent variables/hyper parameters (distance, rotation, absorption, column density, temperature, metalicity, types, system properties, etc). Frequently it becomes very challenging mixture problem to separate lines and to separate lines from continuum (boundary and identifiability issues). These complexity only appears in astronomy spectroscopic data because we only get indirect or uncontrolled data ruled by physics, as opposed to the the meat species spectra in the paper. These spectroscopic data outside astronomy are rather smooth, observed in controlled wavelength range, and no worries for correcting recession/radial velocity/red shift/extinction/lensing/etc.

Although the most relevant part to astronomers, i.e. spectroscopic data processing is not discussed in this paper, the most important part, statistical learning application to complex curves, spectral data, is well described. Some astronomers with appropriate data would like to try the variable selection strategy and to check out the classification methods in statistics. If it works out, it might save space for storing spectral data and time to collect high resolution spectra. Please, keep in mind that it is not necessary to use the same variable selection strategy. Astronomers can create better working versions for classification and clustering purpose, like Hardness Ratios, often used to reduce the dimensionality of spectral data since low total count spectra are not informative in the full energy (wavelength) range. Curse of dimensionality!.

[ArXiv] 3rd week, Jan. 2008

hlee — Fri, 18 Jan 2008 18:24:23 +0000

Seven preprints were chosen this week and two mentioned model selection.

[astro-ph:0801.2186] Extrasolar planet detection by binary stellar eclipse timing: evidence for a third body around CM Draconis H.J.Deeg (it discusses model selection in section 4.4)
[astro-ph:0801.2156] Modeling a Maunder Minimum A. Brandenburg & E. A. Spiegel (it could be useful for those who does sunspot cycle modeling)
[astro-ph:0801.1914] A closer look at the indications of q-generalized Central Limit Theorem behavior in quasi-stationary states of the HMF model A. Pluchino, A. Rapisarda, & C. Tsallis
[astro-ph:0801.2383] Observational Constraints on the Dependence of Radio-Quiet Quasar X-ray Emission on Black Hole Mass and Accretion Rate B.C. Kelly et.al.
[astro-ph:0801.2410] Finding Galaxy Groups In Photometric Redshift Space: the Probability Friends-of-Friends (pFoF) Algorithm I. Li & H. K.C. Yee
[astro-ph:0801.2591] Characterizing the Orbital Eccentricities of Transiting Extrasolar Planets with Photometric Observations E. B. Ford, S. N. Quinn, &D. Veras
[astro-ph:0801.2598] Is the anti-correlation between the X-ray variability amplitude and black hole mass of AGNs intrinsic? Y. Liu & S. N. Zhang

[ArXiv] 2nd week, Jan. 2007

hlee — Fri, 11 Jan 2008 19:44:44 +0000

It is notable that there’s an astronomy paper contains AIC, BIC, and Bayesian evidence in the title. The topic of the paper, unexceptionally, is cosmology like other astronomy papers discussed these (statistical) information criteria (I only found a couple of papers on model selection applied to astronomical data analysis without articulating CMB stuffs. Note that I exclude Bayes factor for the model selection purpose).

To find the paper or other interesting ones, click

[astro-ph:0801.0638]
AIC, BIC, Bayesian evidence and a notion on simplicity of cosmological model M Szydlowski & A. Kurek
[astro-ph:0801.0642]
Correlation of CMB with large-scale structure: I. ISW Tomography and Cosmological Implications S. Ho et.al.
[astro-ph:0801.0780]
The Distance of GRB is Independent from the Redshift F. Song
[astro-ph:0801.1081]
A robust statistical estimation of the basic parameters of single stellar populations. I. Method X. Hernandez and D. Valls–Gabaud
[astro-ph:0801.1106]
A Catalog of Local E+A(post-starburst) Galaxies selected from the Sloan Digital Sky Survey Data Release 5 T. Goto (Carefully built catalogs are wonderful sources for classification/supervised learning, or semi-supervised learning)
[astro-ph:0801.1358]
A test of the Poincare dodecahedral space topology hypothesis with the WMAP CMB data B.S. Lew & B.F. Roukema

In cosmology, a few candidate models to be chosen, are generally nested. A larger model usually is with extra terms than smaller ones. How to define the penalty for the extra terms will lead to a different choice of model selection criteria. However, astronomy papers in general never discuss the consistency or statistical optimality of these selection criteria; most likely Monte Carlo simulations and extensive comparison across those criteria. Nonetheless, my personal thought is that the field of model selection should be encouraged to astronomers to prevent fallacies of blindly fitting models which might be irrelevant to the information that the data set contains. Physics tells a correct model but data do the same.

[ArXiv] Kernel Regression, June 20, 2007

hlee — Mon, 25 Jun 2007 17:27:54 +0000

One of the papers from arxiv/astro-ph discusses kernel regression and model selection to determine photometric redshifts astro-ph/0706.2704. This paper presents their studies on choosing bandwidth of kernels via 10 fold cross-validation, choosing appropriate models from various combination of input parameters through estimating root mean square error and AIC, and evaluating their kernel regression to other regression and classification methods with root mean square errors from literature survey. They made a conclusion of flexibility in kernel regression particularly for data at high z.

Off the topic but worth to be notified:
1. They used AIC for model comparison. In spite of many advocates for BIC, choosing AIC would do a better job for analyzing catalog data (399,929 galaxies) since the penalty term in BIC with huge sample will lead to select the model of most parsimony.

2. Despite that more detailed discussion hasn’t been posted, I’d like to point out photometric redshift studies are more or less regression problems. Whether they use sophisticated and up-to-date classification schemes such as support vector machine (SVM), artificial neural network (ANN), or classical regression methods, the goal of the study in photometric redshifts is finding predictors for right classification and the model from those predictors. I wish there will be some studies on quantile regression, which receive many spotlights recently in economics.

3. Adaptive kernels were mentioned and the results of adaptive kernel regression are highly expected.

4. Comparing root mean square errors from various classification and regression models based on Sloan Digital Sky Survey (SDSS) EDR (Early Data Release) to DR5 (Date Release 5) might mislead the conclusion of choosing the best regression/classification method due to different sample sizes in EDR to DR5. Further formulation, especially asymptotic properties of these root mean square errors will be very useful to make a legitimate comparison among different regression/classification strategies.