The AstroStat Slog » Model Selection

[ArXiv] Cross Validation

hlee — Wed, 12 Aug 2009 23:03:43 +0000

Statistical Resampling Methods are rather unfamiliar among astronomers. Bootstrapping can be an exception but I felt like it’s still unrepresented. Seeing an recent review paper on cross validation from [arXiv] which describes basic notions in theoretical statistics, I couldn’t resist mentioning it here. Cross validation has been used in various statistical fields such as classification, density estimation, model selection, regression, to name a few.

[arXiv:math.ST:0907.4728]
A survey of cross validation procedures for model selection by Sylvain Arlot

Nonetheless, I’ll not review the paper itself except some quotes:

-CV is a popular strategy for model selection, and algorithm selection.
-Compared to the resubstitution error, CV avoids overfitting because the training sample is independent from the validation sample.
-A noticed in the early 30s by Larson (1931), training an algorithm and evaluating its statistical performance on the same data yields an overoptimistic results.

There are books on statistical resampling methods covering more general topics, not limited to model selection. Instead, I decide to do a little search how CV is used in astronomy. These are the ADS search results. More publications than I expected.

Kernel regression for determining photometric redshifts from Sloan broad-band photometry [arXiv:0706.2704]
Wang, D.; Zhang, Y. X.; Liu, C.; Zhao, Y. H.
Monthly Notices of the Royal Astronomical Society, Volume 382, Issue 4, pp. 1601-1606 (2007)
STECKMAP: STEllar Content and Kinematics from high resolution galactic spectra via Maximum A Posteriori [arXiv:0507002]
Ocvirk, P.; Pichon, C.; Lançon, A.; Thiébaut, E.
Monthly Notices of the Royal Astronomical Society, Volume 365, Issue 1, pp. 74-84 (2006)
STECMAP: STEllar Content from high-resolution galactic spectra via Maximum A Posteriori [arXiv:0505209]
Ocvirk, P.; Pichon, C.; Lançon, A.; Thiébaut, E.
Monthly Notices of the Royal Astronomical Society, Volume 365, Issue 1, pp. 46-73 (2006)
Automated Detection of Classical Novae with Neural Networks [arXiv:0604236]
Feeney, S. M et al.
The Astronomical Journal, Volume 130, Issue 1, pp. 84-94 (2005)
Estimation of regularization parameters in multiple-image deblurring [arxiv:0405545]
Vio, R.et al.
Astronomy and Astrophysics, v.423, p.1179-1186 (2004)
Machine learning and image analysis for morphological galaxy classification
de la Calleja, Jorge and Fuentes, Olac
Monthly Notices of the Royal Astronomical Society, Volume 349, Issue 4, pp. 87-93 (2004)
Ensembles of Classifiers for Morphological Galaxy Classification
Bazell, D.; Aha, David W.
The Astrophysical Journal, Volume 548, Issue 1, pp. 219-223.(2001)
Bayesian image reconstruction with space-variant noise suppression
Nunez, J.; Llacer, J.
Astronomy and Astrophysics Supplement, v.131, p.167-180 (1998)
Estimating the sun’s rotation from solar oscillations by regularisation
Thompson, A. M.
Astronomy and Astrophysics (ISSN 0004-6361), vol. 265, no. 1, p. 289-295. (1992)

One can easily grasp that many adopted CV under the machine learning context. The application of CV, and bootstrapping is not limited to machine learning. As Arlot’s title, CV is used for model selection. When it come to model selection in high energy astrophysics, not CV but reduced chi^2 measures and fitted curve eye balling are the standard procedure. Hopefully, a renovated model selection procedure via CV or other statistically robust strategy soon challenge the reduced chi^2 and eye balling. On the other hand, I doubt that it’ll come soon. Remember, eyes are the best classifier so it won’t be a easy task.

Curious Cases of the Null Hypothesis Probability

hlee — Tue, 02 Jun 2009 08:03:13 +0000

Even though I traced the astronomers’ casual usage of the null hypothesis probability in a fashion of reporting outputs from data analysis packages of their choice, there were still some curious cases of the null hypothesis probability that I couldn’t solve. They are quite mysterious to me. Sometimes too much creativity harms the original intention. Here are some examples.

Full text search in ADS with “null hypothesis probability” yield 77 related articles (link removed. Search results are floating urls, probably?). Many of them contained the phrase “null hypothesis probability” as it is. The rest were in the context of “given the null hypothesis, the probability of …” I’m not sure this ADS search result includes null hypothesis probability written in tables and captions. It’s possible more than 77 could exist. The majority of articles with the “null hypothesis probability” are just reporting numbers from screen outputs from the chosen data analysis system. Discussions and interpretations of these numbers are more focused toward reduced χ² close to ONE, astronomers’ most favored model selection criterion. Sometimes, I got confused with the goal of their fitting analysis because the driven force is that “make the reduced chi-square closed to one and make residuals look good“. Instead of being used for statistical inferences and measures, a statistic works as an objective function. Numerically (chi-square) or pictorially (residuals) is overshadowed the fundamentals that you observed relatively low number of photons under Poisson distribution and those photons are convolved with complicated instruments. It is possible to underestimated statistically, the reduced chi-sq is off from the unity but based on robust statistics, one still can say the model is a good fit.

Instead of talking about the business of the chi-square method, one thing I wanted to point out from this “null hypothesis probability” investigation is that there was a big presenting style and field distinction between papers of the null hypothesis probability (spectral model fitting) and of given the null hypothesis, the probability of (cosmology). Beyond this casual and personal finding about the style difference, the following quotes despaired me because I couldn’t find answers from statistics.

MNRAS, v.340, pp.1261-1268 (2003): The temperature and distribution of gas in CL 0016+16 measured with XMM-Newton (Worrall and Birkinshaw)

With reduced chi square of 1.09 (chi-sq=859 for 786 d.o.f) the null hypothesis probability is 4 percent but this is likely to result from the high statistical precision of the data coupled with small remaining systematic calibration uncertainties

I couldn’t understand why p-value=0.04 is associated with high statistical precision of the data coupled with small remaining systematic calibration uncertainties. Is it a polite way to say the chi-square method is wrong due to systematic uncertainty? Or does this mean the stat uncertainty is underestimated due the the correlation with sys uncertainty? Or other than p-value, does the null hypothesis probability has some philosophical meanings? Or … I may go on with strange questions due to the statistical ambiguity of the statement. I’d appreciate any explanation how the p-value (the null hypothesis probability) is associated with the subsequent interpretation.

Another miscellaneous question is that If the number (the null hypothesis probability) from software packages is unfavorable or uninterpretable, can we attribute such ambiguity to systematical error?
MNRAS, v. 345(2),pp.423-428 (2003): Iron K features in the hard X-ray XMM-Newton spectrum of NGC 4151 (Schurch, Warwick, Griffiths, and Sembay)

The result of these modifications was a significantly improved fit (chi-sq=4859 for 4754 d.o.f). The model fit to the data is shown in Fig. 3 and the best-fitting parameter values for this revised model are listed as Model 2 in Table 1. The null hypothesis probability of this latter model (0.14) indicates that this is a reasonable representation of the spectral data to within the limits of the instrument calibration.

What is the rule of thumb interpretation of p-values or this null hypothesis probability in astronomy? How one knows that it is reasonable as authors mentioned? How one knows the limits of the instrument calibration and compares quantitatively? How about the degrees of freedom? Some thousands! So large. Even with a million photons, according to the guideline for the number of bins^[1] I doubt that using chi-square goodness of fit for data with such large degree of freedom makes the test too conservative. Also, there should be distinction between the chi square minimization tactic and the chi square goodness of fit test. Using same data for both procedures will introduce bias.
MNRAS, v. 354, pp.10-24 (2004): Comparing the temperatures of galaxy clusters from hdrodynamical N-body simulations to Chandra and XMM-Newton observations (Mazzotta, Rasia, Moscardini, and Tormen)

In particular, solid and dashed histograms refer to the fits for which the null hypothesis has a probiliy >5 percent (statistically acceptable fit) or <5 percent (statistically unacceptable fit), respectively. We also notice that the reduced chi square is always very close to unity, except in a few cases where the lower temperature components is at T~2keV, …

The last statement obscures the interpretation even more to the statement related to what “statistically (un)acceptable fit” really means. The notion of how good a model fits to data and how to test such hypothesis from the statistics standpoint seems different from that of astronomy.
MNRAS, v.346(4),pp.1231-1241: X-ray and ultraviolet observations of the dwarf nova VW Hyi in quiescence (Pandel, Córdova, and Howell)

As can be seen in the null hypothesis probabilities, the cemekl model is in very good agreement with the data.

The computed null hypothesis probabilities from the table 1 are 8.4, 25.7, 42.2, 1.6, 0.7*, and 13.1 percents (* is the result of MKCFLOW model, the rest are CEMEKL model). Probably, the criterion to declare a good fit is a p-value below 0.01 so that CEMEKL model cannot be rejected but MKCFLOW model can be rejected. Only one MKCFLOW which by accident resulted in a small p-value to say that MKCFLOW is not in agreement but the other choice, CEMEKL model is a good model. Too simplified model selection/assessment procedure. I wonder why CEMEKL was tried with various settings but MKCFLOW was only once. I guess there’s is an astrophysical reason of executing such model comparison study but statistically speaking, it looks like comparing measurement of 5 different kinds of oranges and one apple measured by a same ruler (the null hypothesis probability from the chi-square fitting). From the experimental design viewpoint, this is not well established study.
MNRAS, 349, 1267 (2004): Predictions on the high-frequency polarization properties of extragalactic radio sources and implications for polarization measurements of the cosmic microwave background (Tucci et al.)

The correlation is less clear in the samples at higher frequencies (r~ 0.2 and a null-hypothesis probability of >10^{-2}). However, these results are probably affected by the variability of sources, because we are comparing data taken at different epochs. A strong correlation (r>0.5 and a null-hypothesis probability of <10^{-4}) between 5 and 43 GHz is found for the VLA calibrators, the polarization of which is measured simultaneously at all frequencies.

I wonder what test statistic has been used to compute those p-values. I wonder if they truly meant p-value>0.01. At this level, most tools offer more precise number so as to make a suitable statement. The p-value (or the “null hypothesis probability”) is for testing whether r=0 or not. Even r is small, 0.2, still one can reject the null hypothesis if the threshold is 0.05. Therefore, >10^{-2} only add ambiguity. I think point estimates are enough to report the existence of weak and rather strong correlations. Otherwise, reporting both p-values and powers seems more appropriate.
A&A, 342, 502 (1999): X-ray spectroscopy of the active dM stars: AD Leo and EV Lac
(S. Sciortino, A. Maggio, F. Favata and S. Orlando)

This fit yields a value of chi square of 185.1 with 145 υ corresponding to a null-hypothesis probability of 1.4% to give an adequate description of the AD Leo coronal spectrum. In other words the adopted model does not give an acceptable description of available data. The analysis of the uncertainties of the best-fit parameters yields the 90% confidence intervals summarized in Table 5, together with the best-fit parameters. The confidence intervals show that we can only set a lower-limit to the value of the high-temperature. In order to obtain an acceptable fit we have added a third thermal MEKAL component and have repeated the fit leaving the metallicity free to vary. The resulting best-fit model is shown in Fig. 7. The fit formally converges with a value of chi square of 163.0 for 145 υ corresponding to a probability level of ~ 9.0%, but with the hotter component having a “best-fit” value of temperature extremely high (and unrealistic) and essentially unconstrained, as it is shown by the chi square contours in Fig. 8. In summary, the available data constrain the value of metallicity to be lower than solar, and they require the presence of a hot component whose temperature can only be stated to be higher than log (T) = 8.13. Available data do not allow us to discriminate between the (assumed) thermal and a non-thermal nature of this hot component.
…The fit yields a value of [FORMULA] of 95.2 (for 78 degree of freedom) that corresponds to a null hypothesis probability of 2.9%, i.e. a marginally acceptable fit. The limited statistic of the available spectra does not allow us to attempt a fit with a more complex model.

After adding MEKAL, why the degree of freedom remains same? Also, what do they mean by the limited statistic of the available spectra?
MNRAS348, 529 (2004):Powerful, obscured active galactic nuclei among X-ray hard, optically dim serendipitous Chandra sources (Gandhi, Crawford, Fabian, Johnstone)

…, but a low f-test probability for this suggests that we cannot constrain the width with the current data.
While the rest frame equivalent width of the line is close to 1keV, its significance is marginal (f-test gives a null hypothesis probability of 0.1).

Without a contingency table, nor comparing models, I was not sure how they executed the F-test. I could not find two degrees of freedom for the F-test. From the XSPEC’s account for the F-test (http://heasarc.gsfc.nasa.gov/docs/xanadu/xspec/manual/XSftest.html), we see two degrees of freedom, without them, no probability can be computed. Their usage of the F-test seems unconventional. The conventional application of the F-test is for comparing effects of multiple treatments (different levels of drug dosage including placebo); otherwise, it’s just a chi square goodness of fit test or t-test.
Another occasion I came across is interpreting the null hypothesis probability of 0.99 as an indicator of a good fit; well, it’s overfitting. Not only too small null hypothesis probability but also close to one null hypothesis probability should raise a flag for cautions and warnings because the later indicating you are overdoing (too many free parameters for example).

There are some residuals of ambiguity after deducing the definition of the null hypothesis probability by playing with XSPEC and finding cases how this null hypothesis probability is used in literature. Authors sometimes added creative comments in order to interpret the null hypothesis probability from their data analysis, which I cannot understand without statistical imagination. Most can be overlooked, perhaps. Or instead, they are rather to be addressed to astronomers with statistical knowledge to resolve my confusion by the null hypothesis probability. I expect comments on how to view these quotes with statistical rigor from astronomers. The listed are personal. There are some more I really didn’t understand the points but many were straightforward in using the null hypothesis probabilities as p-values in statistical inference under the simple null hypothesis. I just listed some to display my first impression on these quotes most of which I couldn’t draw statistical caricatures out of them. Eventually, I hope some astronomers straighten the meaning and the usage of the null hypothesis probability without overruling basics in statistics.

I only want to add a caution when using the reduced chi-square as a model selection criteria. An indicator of a good-fit from a reduced chi^2 close to unity is only true when grouped data are independent so that the formula of degrees of freedom, roughly, the number of groups minus the number of free parameters, is valid. Personally I doubt this rule applied in spectral fitting that one cannot expect independence between two neighboring bins. In other words, given a source model and given total counts, two neighboring observations (counts in two groups) are correlated. The grouping rules like >25 or S/N>3 do not guarantee the independent assumption for the chi-square goodness of fit test although it may sufficient for Gaussian approximation. Statisticians devised various penalty terms and regularization methods for model selection that suits data types. One way to look is computing proper degrees of freedom, called effective degrees of freedom instead of n-p, to reflect the correlation across groups because of the chosen source model and calibration information. With a large number of counts or large number of groups, unless properly penalized, it is likely that the chi-square fit is hard to reject the null hypothesis than a statistic with smaller degrees of freedom because of the curse of dimensionality.

Mann and Wald (1942), “On the Choice of the Number of Class Intervals in the Application of the Chi-square Test” Annals of Math. Stat. vol. 13, pp.306-7.

[ArXiv] 4th week, May 2008

hlee — Sun, 01 Jun 2008 03:59:15 +0000

Eight astro-ph papers and two statistics paper are listed this week. One statistics paper discusses detecting filaments and the other talks about maximum likelihood estimation of satellite images (clouds).

[astro-ph:0805.3532] Balan and Lahav
ExoFit: Orbital Parameters of Extra-solar Planets from Radial Velocities (MCMC)
[astro-ph:0805.3983] R. G. Carlberg et al.
Clustering of supernova Ia host galaxies (Jackknife method is utilized).
[astro-ph:0805.4005] Kurek, Hrycyna, & Szydlowski
From model dynamics to oscillating dark energy parametrisation (Bayes factor)
[astro-ph:0805.4136] C. Genovese et al.
Inference for the Dark Energy Equation of State Using Type Ia Supernova Data
[math.ST:0805.4141] C. Genovese et al.
On the path density of a gradient field (detecting filaments via kernel density estimation, KDE)
[astro-ph:0805.4342] C. Espaillat et al.
Wavelet Analysis of AGN X-Ray Time Series: A QPO in 3C 273?
[astro-ph:0805.4414] Tegmark and Zaldarriaga
The Fast Fourier Transform Telescope
[astro-ph:0805.4417] A. Georgakakis et al.
A new method for determining the sensitivity of X-ray imaging observations and the X-ray number counts
[stat.AP:0805.4598] E. Anderes et al.
Maximum Likelihood Estimation of Cloud Height from Multi-Angle Satellite Imagery

[ArXiv] 2nd week, Mar. 2008

hlee — Fri, 14 Mar 2008 19:44:34 +0000

Warning! The list is long this week but diverse. Some are of CHASC’s obvious interest.

[astro-ph:0803.0997] V. Smolcic et.al.
A new method to separate star forming from AGN galaxies at intermediate redshift: The submillijansky radio population in the VLA-COSMOS survey
[astro-ph:0803.1048] T.A. Carroll and M. Kopf
Zeeman-Tomography of the Solar Photosphere — 3-Dimensional Surface Structures Retrieved from Hinode Observations
[astro-ph:0803.1066] M. Beasley et.al.
A 2dF spectroscopic study of globular clusters in NGC 5128: Probing the formation history of the nearest giant Elliptical
[astro-ph:0803.1098] Z. Lorenzo
A new luminosity function for galaxies as given by the mass-luminosity relationship
[astro-ph:0803.1199] D. Coe et.al.
LensPerfect: Gravitational Lens Massmap Reconstructions Yielding Exact Reproduction of All Multiple Images (could it be related to GREAT08 Challenge?)
[astro-ph:0803.1213] H.Y.Wang et.al.
Reconstructing the cosmic density field with the distribution of dark matter halos
[astro-ph:0803.1420] E. Lantz et.al.
Multi-imaging and Bayesian estimation for photon counting with EMCCD’s
[astro-ph:0803.1491] Wu, Rozo, & Wechsler
The Effect of Halo Assembly Bias on Self Calibration in Galaxy Cluster Surveys
[astro-ph:0803.1616] P. Mukherjee et.al.
Planck priors for dark energy surveys (some CHASCians would like to check!)
[astro-ph:0803.1738] P. Mukherjee and A. R. Liddle
Planck and reionization history: a model selection view
[astro-ph:0803.1814] J. Cardoso et.al.
Component separation with flexible models. Application to the separation of astrophysical emissions
[astro-ph:0803.1851] A. R. Marble et.al.
The Flux Auto- and Cross-Correlation of the Lyman-alpha Forest. I. Spectroscopy of QSO Pairs with Arcminute Separations and Similar Redshifts
[astro-ph:0803.1857] R. Marble et.al.
The Flux Auto- and Cross-Correlation of the Lyman-alpha Forest. II. Modelling Anisotropies with Cosmological Hydrodynamic Simulations

[ArXiv] 3rd week, Feb. 2008

hlee — Mon, 25 Feb 2008 02:56:54 +0000

It seems like I omit papers deserving attentions from time to time. If you find one, please leave a message. Even better if a summary can be left for a separate posting.

Wavelet papers:

[astro-ph:0802.2377] J. M. Lilly & S. C. Olhede
On the Design of Optimal Analytic Wavelets
[math.ST:0802.2424] Autin, Le Pennec & Tribouley
Thresholding methods to estimate the copula density

A statistics paper and astro-ph papers adopted statistical tools:

[stat.ME:0802.2155] Guellil & Kernane
A New Approach of Point Estimation and its Application to Truncated Data Situations
[astro-ph:0802.2105] N. Padmanabhan et.al.
The real-space clustering of luminous red galaxies around z<0.6 quasars in the Sloan Digital Sky Survey
[astro-ph:0802.2446] Banerjee & Ghosh
Evolution of Compact-Binary Populations in Globular Clusters: A Boltzmann Study II. Introducing Stochasticity
[astro-ph:0802.2944] E. W. Rosolowsky et.al.
Structural Analysis of Molecular Clouds: Dendrograms
[astro-ph:0802.3185] G. Efstathiou
Limitations of Bayesian Evidence Applied to Cosmology
[astro-ph:0802.3199] A. A. Mahabal et. al.
Automated Probabilistic Classification of Transients and Variables

[Quote] When all the models are wrong

hlee — Mon, 18 Feb 2008 21:09:34 +0000

From page 103 of Bayesian Model Selection and Model Averaging by L. Wasserman (2000) Journal of Mathematical Psychology, 44, pp.92-107

… So does it make sense to compare any finite list of models when we do not literally believe any of them? The answer is: sometimes.
First, we would hope that, while none of the models is exactly correct, at least one is approximately correct. It behooves the data analyst to do common sense exploratory analysis — checking residuals, for example — to make sure that not all of the models are heinously wrong. [---snip---]
Second, even when all models are wrong, it is useful to consider the relative merits of two models. Newtonian physics and general relativity are both wrong. Yet it makes sense to compare the relative evidence in favor of one or the other. …

Any discussion between Newtonian physics and general relativity had been a very grief subject for physicists and philosophers. For nowadays statisticians, it’s just another example.

[ArXiv] 1st week, Feb. 2008

hlee — Sun, 10 Feb 2008 16:56:12 +0000

Review papers on Bayesian hierarchical modeling and LAR (least angle regression) appeared in this week’s stat arXiv and in addition to interesting astro-ph papers.

A review paper on LASSO and LAR: [stat.ME:0801.0964] T. Hesterberg et.al.
Least Angle and L₁ Regression: A Review
Model checking for Bayesian hierarchical modeling: [stat.ME:0802.0743] M. J. Bayarri, M. E. Castellanos
Bayesian Checking of the Second Levels of Hierarchical Models

[astro-ph:0802.0042] Y. Kubo
Statistical Models for Solar Flare Interval Distribution in Individual Active Regions (it discusses AIC)
[astro-ph:0802.0131] J.Bobin, J-L Starck and R. Ottensamer
Compressed Sensing in Astronomy
[astro-ph:0802.0387] J. Gaite
Geometry and scaling of cosmic voids
[astro-ph:0802.0400] R. Vio & P. Andreani
A Modified ICA Approach for Signal Separation in CMB Maps
[astro-ph:0802.0498] V. Balasubramanian, K. Larjo and R. Sheth
Experimental design and model selection: The example of exoplanet detection
[astro-ph:0802.0537] G. Dan, Z. Yanxia, & Z. Yongheng
Support Vector Machines and Kd-tree for Separating Quasars from Large Survey Databases

[ArXiv] 3rd week, Jan. 2008

hlee — Fri, 18 Jan 2008 18:24:23 +0000

Seven preprints were chosen this week and two mentioned model selection.

[astro-ph:0801.2186] Extrasolar planet detection by binary stellar eclipse timing: evidence for a third body around CM Draconis H.J.Deeg (it discusses model selection in section 4.4)
[astro-ph:0801.2156] Modeling a Maunder Minimum A. Brandenburg & E. A. Spiegel (it could be useful for those who does sunspot cycle modeling)
[astro-ph:0801.1914] A closer look at the indications of q-generalized Central Limit Theorem behavior in quasi-stationary states of the HMF model A. Pluchino, A. Rapisarda, & C. Tsallis
[astro-ph:0801.2383] Observational Constraints on the Dependence of Radio-Quiet Quasar X-ray Emission on Black Hole Mass and Accretion Rate B.C. Kelly et.al.
[astro-ph:0801.2410] Finding Galaxy Groups In Photometric Redshift Space: the Probability Friends-of-Friends (pFoF) Algorithm I. Li & H. K.C. Yee
[astro-ph:0801.2591] Characterizing the Orbital Eccentricities of Transiting Extrasolar Planets with Photometric Observations E. B. Ford, S. N. Quinn, &D. Veras
[astro-ph:0801.2598] Is the anti-correlation between the X-ray variability amplitude and black hole mass of AGNs intrinsic? Y. Liu & S. N. Zhang

[ArXiv] 2nd week, Jan. 2007

hlee — Fri, 11 Jan 2008 19:44:44 +0000

It is notable that there’s an astronomy paper contains AIC, BIC, and Bayesian evidence in the title. The topic of the paper, unexceptionally, is cosmology like other astronomy papers discussed these (statistical) information criteria (I only found a couple of papers on model selection applied to astronomical data analysis without articulating CMB stuffs. Note that I exclude Bayes factor for the model selection purpose).

To find the paper or other interesting ones, click

[astro-ph:0801.0638]
AIC, BIC, Bayesian evidence and a notion on simplicity of cosmological model M Szydlowski & A. Kurek
[astro-ph:0801.0642]
Correlation of CMB with large-scale structure: I. ISW Tomography and Cosmological Implications S. Ho et.al.
[astro-ph:0801.0780]
The Distance of GRB is Independent from the Redshift F. Song
[astro-ph:0801.1081]
A robust statistical estimation of the basic parameters of single stellar populations. I. Method X. Hernandez and D. Valls–Gabaud
[astro-ph:0801.1106]
A Catalog of Local E+A(post-starburst) Galaxies selected from the Sloan Digital Sky Survey Data Release 5 T. Goto (Carefully built catalogs are wonderful sources for classification/supervised learning, or semi-supervised learning)
[astro-ph:0801.1358]
A test of the Poincare dodecahedral space topology hypothesis with the WMAP CMB data B.S. Lew & B.F. Roukema

In cosmology, a few candidate models to be chosen, are generally nested. A larger model usually is with extra terms than smaller ones. How to define the penalty for the extra terms will lead to a different choice of model selection criteria. However, astronomy papers in general never discuss the consistency or statistical optimality of these selection criteria; most likely Monte Carlo simulations and extensive comparison across those criteria. Nonetheless, my personal thought is that the field of model selection should be encouraged to astronomers to prevent fallacies of blindly fitting models which might be irrelevant to the information that the data set contains. Physics tells a correct model but data do the same.

[ArXiv] Post Model Selection, Nov. 7, 2007

hlee — Wed, 07 Nov 2007 15:57:01 +0000

Today’s arxiv-stat email included papers by Poetscher and Leeb, who have been working on post model selection inference. Sometimes model selection is misled as a part of statistical inference. Simply, model selection can be considered as a step prior to inference. How you know your data are from chi-square distribution, or gamma distribution? (this is a model selection problem with nested models.) Should I estimate the degree of freedom, k from Chi-sq or α and β from gamma to know mean and error? Will the errors of the mean be same from both distributions?

Prior to estimating means and errors of parameters, one wishes to choose a model where parameters of interests are properly embedded. The arising problem is one uses the same data to choose a model (e.g. choosing the model with the largest likelihood value or bayes factor) as well as to perform statistical inference (estimating parameters, calculating confidence intervals and testing hypotheses), which inevitably introduces bias. Such bias has been neglected in general (a priori tells what model to choose: e.g. the 2nd order polynomial is the absolute truth and the residuals are realizations of the error term, by the way how one can sure that the error follows normal distribution?). Asymptotics enables this bias to be O(n^m), where m is smaller than zero. Estimating this bias has been popular since Akaike introduced AIC (one of the most well known model selection criteria). Numerous works are found in the field of robust penalized likelihood. Variable selection has been a very hot topic in a recent few decades. Beyond my knowledge, there were more approaches to cope with this bias not to contaminate the inference results.

The works by Professors Poetscher and Leeb looked unique to me in the line of resolving the intrinsic bias arise from inference after model selection. In stead of being listed in my weekly arxiv lists, their arxiv papers deserved to be listed under a separate posting. I also included some more general references.

The list of paper from today’s arxiv:

[stat.TH:0702703] Can one estimate the conditional distribution of post-model-selection estimators? by H. Leeb and B. M. P\”{o}tscher
[stat.TH:0702781] The distribution of model averaging estimators and an impossibility result regarding its estimation by B. M. P\”{o}tscher
[stat.TH:0704.1466] Sparse Estimators and the Oracle Property, or the Return of Hodges’ Estimator by H. Leeb and B. M. Poetscher
[stat.TH:0711.0660] On the Distribution of Penalized Maximum Likelihood Estimators: The LASSO, SCAD, and Thresholding by B. M. Poetscher, and H. Leeb
[stat.TH:0701781] Learning Trigonometric Polynomials from Random Samples and Exponential Inequalities for Eigenvalues of Random Matrices by K. Groechenig, B.M. Poetscher, and H. Rauhut

Other resources:

Prof. Leeb’s website has other published papers
Effects of Model Selection on Inference B.M.Potscher, Econometric Theory, Vol. 7, No. 2 (Jun., 1991), pp. 163-185
The Effect of Model Selection on Confidence Regions and Prediction Regions P.Kabaila, Econometric Theory, Vol. 11, No. 3 (Aug., 1995), pp. 537-549
Model Selection and Multi-Model Inference: a book by Burnham and Anderson
modelselection.org: it’s a model selection website but looks like pageant show website.

[Added on Nov.8th] There were a few more relevant papers from arxiv.

[stat.AP:0711.0993] Upper bounds on the minimum coverage probability of confidence intervals in regression after variable selection by P. Kabaila and K. Giri
[stat.ME:0710.1036] Confidence Sets Based on Sparse Estimators Are Necessarily Large by B. M. Pötscher