The AstroStat Slog » bias

[MADS] Law of Total Variance

hlee — Fri, 29 May 2009 04:54:20 +0000

This simple law, despite my trial of full text search, was not showing in ADS. As discussed in systematic errors, astronomers, like physicists, show their error components in two additive terms; statistical error + systematic error. To explain such decomposition and to make error analysis statistically rigorous, the law of total variance (LTV) seems indispensable.

V[X] = V[E[X|Y]] + E[V[X|Y]]

(X and Y are random variables, and X indicates observed data. In addition, V and E stands for variance and expectation, respectively. Instead of X, f(X_1,…,X_n) can be plugged in to represent a best fit. In other words, a best fit is a solution of the chi-square minimization which is a function of data). For Bayesian, the uncertainty of theta, parameter of interest, is

V[theta]=V[E[theta|Y]] + E[V[theta|Y]]

Suppose Y is related to systematics. E[theta|Y] is a function of Y so that V[E[theta|Y] indicates systematic error. V[theta|Y] is statistical error given Y which reflects the fact that unless the parameters of interest and systematics are independent, statistical error cannot be quantified into a single factor next to a best fit. If parameter of interest, theta is independent of Y and Y is fixed, then the uncertainty in theta is solely come from statistical uncertainty (Let’s not consider “model uncertainty” for the time being).

In conjunction of astronomers’ systematic error and statistical error decomposition or representing uncertainties in quadrature (error²_total = error²_stat+error²_sys), statisticians use mean square error (MSE) as total error, in which variance matches statistical error, and bias^2 does systematic error.

MSE = Variance+ bias^2

Now it comes to a question, is systematic error bias? Those methods based on quadratures or parameterization of systematics for marginalization consider systematic error as bias although no account explicitly said so. According to the law of total variance unless it’s orthogonal/independent, quadrature is not proper way to handle systematic uncertainties prevailing in all instruments. Generally parameters (data) and systematics are nonlinearly correlated and hard to factorize (instrument specific empirical studies exist to offer correction factors due to systematics; however, such factors work only on specific cases and the process of defining correction factors is hard to be generalized). Because of the varying nature of systematics over the parameter space, instead of MSE

or mean integrated square error might be of use. The estimator of f(x), or \hat f(x) is either parametrically or nonparametrically estimable while incorporating systematics and correlation structures with statistical errors as a function of a certain domain x. MISE can be viewed as a robust version of chi-square methods but details have not been explored to account for the following identity.

This equation may or may not look simple. Perhaps, the expansion of the above identify could explain more on the error decomposition.

integreated squared bias + the integrated variance (overall systematic error + overall statistical error)

Furthermore, it robustly characterizes uncertainties from systematics, i.e. calibration uncertainties in data analysis. Note that estimating f(x) or \hat f(x) reflects complex structures in uncertainty analysis; whereas, the chi square minimization estimates f(x) via piecewise straight horizontal lines, assumes the homogeneous error in each piece (bin), forces statistical and systematic errors to be orthogonal, and as a consequence, inflates the size of error or produces biased best fits.

Either LTV, MSE, or MISE, even if we do not know the true model f(x) — if unknown, assessing statistical analysis results such as confidence levels/intervals may not be feasible; the reason that chi-square methods offer best fits and its N sigma error bars is that it assume the true model is Gaussian, or N(f(x),\sigma^2), or E(Y|X)=f(X)+\epsilon, V(\epsilon)=\sigma^2 where f(x) is a source model. On the other hand, Monte Carlo simulations, resampling methods like bootstrap, or posterior predictive probability (ppp) allows to infer the truth from which one can evaluate the p-value to indicate one’s confidence on the result from fitting analysis in a nonparametric fashion. — setting up proper models for \hat f(x) or \theta|Y would help assessing the total error in a more realistic manner than the chi-square minimization, additive errors, gaussian quadrature, or subjective expertise on systematics. The underlying notions and related theoretical statistics methodologies of LTV, MSE, or MISE could clarify the questions like how to quantify systematic errors and how systematic uncertainties are related to statistical uncertainties. Well, nothing will make me and astronomers happier if those errors are independent and additive. Even more exuberant, systematic uncertainty can be factorized.

Poisson vs Gaussian

vlk — Thu, 09 Apr 2009 23:01:58 +0000

We astronomers are rather fond of approximating our counting statistics with Gaussian error distributions, and a lot of ink has been spilled justifying and/or denigrating this habit. But just how bad is the approximation anyway?

I ran a simple Monte Carlo based test to compute the expected bias between a Poisson sample and the “equivalent” Gaussian sample. The result is shown in the plot below.

The jagged red line is the fractional expected bias relative to the true intensity. The typical recommendation in high-energy astronomy is to bin up events until there are about 25 or so counts per bin. This leads to an average bias of about 2% in the estimate of the true intensity. The bias drops below 1% for counts >50. The smooth blue line is the reciprocal of the square-root of the intensity, reflecting the width of the Poisson distribution relative to the true intensity, and is given here only for illustrative purposes.

Poisson-Gaussian bias

Exemplar IDL code that can be used to generate this kind of plot is appended below:
nlam=100L & nsim=20000L lam=indgen(nlam)+1 & sct=intarr(nlam,nsim) & scg=sct & dct=fltarr(nlam) for i=0L,nlam-1L do sct[i,*]=randomu(seed,nsim,poisson=lam[i]) for i=0L,nlam-1L do scg[i,*]=randomn(seed,nsim)*sqrt(lam[i])+lam[i] for i=0,nlam-1L do dct[i]=mean(sct[i,*]-scg[i,*])/(lam[i]) plot,lam,dct,/yl,yticklen=1,ygrid=1 oplot,lam,1./sqrt(lam)

Lost in Translation: Measurement Error

vlk — Sat, 03 Jan 2009 03:24:32 +0000

You would think that something like “measurement error” is a well-defined concept, and everyone knows what it means. Not so. I have so far counted at least 3 different interpretations of what it means.

Suppose you have measurements X={X_i, i=1..N} of a quantity whose true value is, say, X₀. One can then compute the mean and standard deviation of the measurements, E(X) and σ_X. One can also infer the value of a parameter θ(X), derive the posterior probability density p(θ|X), and obtain confidence intervals on it.

So here are the different interpretations:

Measurement error is σ_X, or the spread in the measurements. Astronomers tend to use the term in this manner.
Measurement error is X₀-E(X), or the “error made when you make the measurement”, essentially what is left over beyond mere statistical variations. This is how statisticians seem to use it, essentially the bias term. To quote David van Dyk

For us it is just English. If your measurement is different from the real value. So this is not the Poisson variability of the source for effects or ARF, RMF, etc. It would disappear if you had a perfect measuring device (e.g., telescope).
Measurement error is the width of p(θ|X), i.e., the measurement error of the first type propagated through the analysis. Astronomers use this too to refer to measurement error.

Who am I to say which is right? But be aware of who you may be speaking with and be sure to clarify what you mean when you use the term!

my first AAS. V. measurement error and EM

hlee — Fri, 20 Jun 2008 03:46:05 +0000

While discussing different view points on the term, clustering, one of the conversers led me to his colleague’s poster. This poster (I don’t remember its title and abstract) was my favorite from all posters in the meeting.

He rewrote the EM algorithm to include measurement errors in redshifts. Indexed parameters associated with different redshifts and corresponding standard deviations (measurement errors, treated as nuisance parameters) were included in the likelihood function that corrected bias and manifested bimodality in the LFs clearly at the different evolutionary stages.

I encouraged him to talk statisticians to characterize and to generalize his measurement error included likelihoods, and to optimize his EM algorithm. Because of approximations in algebra and the many parameters from measurement errors from redshifts, some assumptions and constraints were imposed intensively and I thought a collaboration with statisticians suits to get around constraints and to generalize his measurement error included likelihood.

Eddington versus Malmquist

vlk — Thu, 13 Mar 2008 17:53:17 +0000

During the runup to his recent talk on logN-logS, Andreas mentioned how sometimes people are confused about the variety of statistical biases that afflict surveys. They usually know what the biases are, but often tend to mislabel them, especially the Eddington and Malmquist types. Sort of like using “your” and “you’re” interchangeably, which to me is like nails on a blackboard. So here’s a brief summary:

Eddington Bias: What you get because of statistical fluctuations in the measurement (Eddington 1913). A set of sources with a single luminosity will, upon observation, be spread out due to measurement error. When you have two sets of sources with different luminosities, the observed distribution will overlap. If there are more objects of one luminosity than the other, you are in danger of misunderestimating the fraction in that set because more of those “scatter” into the other’s domain than the reverse. Another complication — if the statistical scatter bumps up against some kind of detection threshold, then the inferred luminosity based on only the detected sources will end up being an overestimate.

Malmquist Bias: What you get because you can see brighter sources out to farther distances. This means that if your survey is flux limited (as most are), then the intrinsically brighter sources will appear to be more numerous than they ought to be because you are seeing them in a larger volume. This is the reason, for instance, that there are 10 times more A stars in the SAO catalog than there are M stars. This is a statistical effect only in the sense that a “true” dataset is filtered due to a detectability threshold. Anyone working with volume limited samples do not need to worry about this at all.

[ArXiv] Post Model Selection, Nov. 7, 2007

hlee — Wed, 07 Nov 2007 15:57:01 +0000

Today’s arxiv-stat email included papers by Poetscher and Leeb, who have been working on post model selection inference. Sometimes model selection is misled as a part of statistical inference. Simply, model selection can be considered as a step prior to inference. How you know your data are from chi-square distribution, or gamma distribution? (this is a model selection problem with nested models.) Should I estimate the degree of freedom, k from Chi-sq or α and β from gamma to know mean and error? Will the errors of the mean be same from both distributions?

Prior to estimating means and errors of parameters, one wishes to choose a model where parameters of interests are properly embedded. The arising problem is one uses the same data to choose a model (e.g. choosing the model with the largest likelihood value or bayes factor) as well as to perform statistical inference (estimating parameters, calculating confidence intervals and testing hypotheses), which inevitably introduces bias. Such bias has been neglected in general (a priori tells what model to choose: e.g. the 2nd order polynomial is the absolute truth and the residuals are realizations of the error term, by the way how one can sure that the error follows normal distribution?). Asymptotics enables this bias to be O(n^m), where m is smaller than zero. Estimating this bias has been popular since Akaike introduced AIC (one of the most well known model selection criteria). Numerous works are found in the field of robust penalized likelihood. Variable selection has been a very hot topic in a recent few decades. Beyond my knowledge, there were more approaches to cope with this bias not to contaminate the inference results.

The works by Professors Poetscher and Leeb looked unique to me in the line of resolving the intrinsic bias arise from inference after model selection. In stead of being listed in my weekly arxiv lists, their arxiv papers deserved to be listed under a separate posting. I also included some more general references.

The list of paper from today’s arxiv:

[stat.TH:0702703] Can one estimate the conditional distribution of post-model-selection estimators? by H. Leeb and B. M. P\”{o}tscher
[stat.TH:0702781] The distribution of model averaging estimators and an impossibility result regarding its estimation by B. M. P\”{o}tscher
[stat.TH:0704.1466] Sparse Estimators and the Oracle Property, or the Return of Hodges’ Estimator by H. Leeb and B. M. Poetscher
[stat.TH:0711.0660] On the Distribution of Penalized Maximum Likelihood Estimators: The LASSO, SCAD, and Thresholding by B. M. Poetscher, and H. Leeb
[stat.TH:0701781] Learning Trigonometric Polynomials from Random Samples and Exponential Inequalities for Eigenvalues of Random Matrices by K. Groechenig, B.M. Poetscher, and H. Rauhut

Other resources:

Prof. Leeb’s website has other published papers
Effects of Model Selection on Inference B.M.Potscher, Econometric Theory, Vol. 7, No. 2 (Jun., 1991), pp. 163-185
The Effect of Model Selection on Confidence Regions and Prediction Regions P.Kabaila, Econometric Theory, Vol. 11, No. 3 (Aug., 1995), pp. 537-549
Model Selection and Multi-Model Inference: a book by Burnham and Anderson
modelselection.org: it’s a model selection website but looks like pageant show website.

[Added on Nov.8th] There were a few more relevant papers from arxiv.

[stat.AP:0711.0993] Upper bounds on the minimum coverage probability of confidence intervals in regression after variable selection by P. Kabaila and K. Giri
[stat.ME:0710.1036] Confidence Sets Based on Sparse Estimators Are Necessarily Large by B. M. Pötscher

Coverage issues in exponential families

hlee — Thu, 16 Aug 2007 20:36:51 +0000

I’ve been heard so much, without knowing fundamental reasons (most likely physics), about coverage problems from astrostat/phystat groups. This paper might be an interest for those: Interval Estimation in Exponential Families by Brown, Cai,and DasGupta ; Statistica Sinica (2003), 13, pp. 19-49

Abstract summary:
The authors investigated issues in interval estimation of the mean in the exponential family, such as binomial, Poisson, negative binomial, normal, gamma, and a sixth distribution. The poor performance of the Wald interval has been known not only for discrete cases but for nonnormal continuous cases with significant negative bias. Their computation suggested that the equal tailed Jeffreys interval and the likelihood ratio interval are the best alternatives to the Wald interval.

Brief summary of the paper without equations:
The objective of this paper is interval estimation of the mean in the natural exponential family (NEF) with quadratic variance functions (QVF) and the particular focus has given to discrete NEF-QVF families consisting of the binomial, negative binomial, and the Poission distributions. It is well known that the Wald interval for a binomial proportion suffers from a systematic negative bias and oscillation in its coverage probability even for large n and p near 0.5, which seems to arise from the lattice nature and the skewness of the binomial distribution. They exemplified this systematic bias and oscillation with Poisson cases to illustrate the poor and erratic behavior of the Wald interval in lattice problems. They proved the bias expressions of the three discrete NEF-QVF distributions and added a disconcerting graphical illustration of this negative bias.

Interested readers should check the figure 4, where the performances of the Wald, score, likelihood ratio (LR), and Jeffreys intervals were compared. Also, the figure 5 illustrated the limits of those four intervals: LR and Jeffreys’ intervals were indistinguishable. They derived the coverage probabilities of four intervals via Edgeworth expansions. The nonoscillating O(n^-1) terms from the Edgeworth expansions were studied to compare the coverage properties of these four intervals. The figure 6 shows that the Wald interval has serious negative bias, whereas the nonoscillating term in the score interval is positive for all three, binomial, negative binomial, and Poission distributions. The negative bias of the Wald interval is also found from continuous distributions like normal, gamma, and NEF-GHS distributions (Figure 7).

As a conclusion, they reconfirmed their findings like LR and Jeffreys intervals are the best alternative to the Wald interval in terms of the negative bias in the coverage and the length. The Rao score interval has a merit of easy presentations but its performance is inferior to LR and Jeffreys’ intervals although it is better than the Wald interval. Yet, the authors left a room for users that choosing one of these intervals is a personal choice.

[Addendum] I wonder if statistical properties of Gehrels’ confidence limits have been studied after the publication. I’ll try to post findings about the statistics of the Gehrels’ confidence limits, shortly(hopefully).

Astrostatistics: Goodness-of-Fit and All That!

hlee — Wed, 15 Aug 2007 02:17:00 +0000

During the International X-ray Summer School, as a project presentation, I tried to explain the inadequate practice of χ^2 statistics in astronomy. If your best fit is biased (any misidentification of a model easily causes such bias), do not use χ^2 statistics to get 1σ error for the 68% chance of capturing the true parameter.

Later, I decided to do further investigation on that subject and this paper came along: Astrostatistics: Goodness-of-Fit and All That! by Babu and Feigelson.

First, the authors pointed out that the χ^2 method 1) is inappropriate when errors are non-gaussian, 2) does not provide clear decision procedures between models with different numbers of parameters or between acceptable models, and 3) is possibly difficult to obtain confidence intervals on parameters when complex correlations between the parameters are present. As a remedy to the χ^2 method, they introduced distribution free tests, such as Kolmogorov-Smirnoff (K-S) test, Cramer-von Mises (C-vM) test, and Anderson-Darling (A-D) test. Among these distribution free tests, the K-S test is well known to astronomers but it has been ignored that the results from these tests become unreliable when the data come from a multivariate distribution. Furthermore, K-S tests fail when the data set is used for parameter estimation and computing the empirical distribution function.

The authors proposed resampling schemes to overcome the above shortcomings by showing both parametric and nonparametric bootstrap methods, and advanced to model comparison particularly when models are not nested. The best fit model can be chosen among other candidate models based on their distances (e.g. Kullback-Leibler distance) to the unknown hypothetical true model.

All your bias are belong to us

vlk — Mon, 04 Jun 2007 21:42:38 +0000

Leccardi & Molendi (2007) have a paper in A&A (astro-ph/0705.4199) discussing the biases in parameter estimation when spectral fitting is confronted with low counts data. Not surprisingly, they find that the bias is higher for lower counts, for standard chisq compared to C-stat, for grouped data compared to ungrouped. Peter Freeman talked about something like this at the 2003 X-ray Astronomy School at Wallops Island (pdf1, pdf2), and no doubt part of the problem also has to do with the (un)reliability of the fitting process when the chisq surface gets complicated.

Anyway, they propose an empirical method to reduce the bias by computing the probability distribution functions (pdfs) for various simulations, and then averaging the pdfs in groups of 3. Seems to work, for reasons that escape me completely.

[Update: links to Peter's slides corrected]