The AstroStat Slog » Fitting

Everybody needs crampons

vlk — Fri, 30 Apr 2010 16:12:36 +0000

Sherpa is a fitting environment in which Chandra data (and really, X-ray data from any observatory) can be analyzed. It has just undergone a major update and now runs on python. Or allows python to run. Something like that. It is a very powerful tool, but I can never remember how to use it, and I have an amazing knack for not finding what I need in the documentation. So here is a little cheat sheet (which I will keep updating ~~as and when~~ if I learn more):

2010-apr-30: Aneta has setup a blogspot site to deal with simple Sherpa techniques and tactics: http://pysherpa.blogspot.com/

On Help:

In general, to get help, use: ahelp "something" (note the quotes)
Even more useful, type: ? wildcard to get a list of all commands that include the wildcard
You can also do a form of autocomplete: type TAB after writing half a command to get a list of all possible completions.

Data I/O:

To read in your PHA file, use: load_pha()
Often for Chandra spectra, the background is included in that same file. In any case, to read it in separately, use: load_bkg()
- Q: should it be loaded in to the same dataID as the source?
- A: Yes.
- A: When the background counts are present in the same file, they can be read in separately and assigned to the background via set_bkg('src',get_data('bkg')), so counts from a different file can be assigned as background to the current spectrum.
To read in the corresponding ARF, use: load_arf()
- Q: load_bkg_arf() for the background — should it be done before or after load_bkg(), or does it matter?
- A: does not matter
To read in the corresponding RMF, use: load_rmf()
- Q: load_bkg_rmf() for the background, and same question as above
- A: same answer as above; does not matter.
To see the structure of the data, type: print(get_data()) and print(get_bkg())
To select a subset of channels to analyze, use: notice_id()
To subtract background from source data, use: subtract()
To not subtract, to undo the subtraction, etc., use: unsubtract()
To plot useful stuff, use: plot_data(), plot_bkg(), plot_arf(), plot_model(), plot_fit(), etc.
(Q: how in god’s name does one avoid plotting those damned error bars? I know error bars are necessary, but when I have a spectrum with 8192 bins, I don’t want it washed out with silly 1-sigma Poisson bars. And while we are asking questions, how do I change the units on the y-axis to counts/bin? A: rumors say that plot_data(1,yerr=0) should do the trick, but it appears to be still in the development version.)

Fitting:

To fit model to the data, command it to: fit()
To get error bars on the fit parameters, use: projection() (or covar(), but why deliberately use a function that is guaranteed to underestimate your error bars?)
Defining models appears to be much easier now. You can use syntax like: set_source(ModelName.ModelID+AnotherModel.ModelID2) (where you can distinguish between different instances of the same type of model using the ModelID — e.g., set_source(xsphabs.abs1*powlaw1d.psrc+powlaw1d.pbkg))
To see what the model parameter values are, type: print(get_model())
To change statistic, use: set_stat() (options are various chisq types, cstat, and cash)
To change the optimization method, use: set_method() (options are levmar, moncar, neldermead, simann, simplex)

Timestamps:
v1:2007-dec-18
v2:2008-feb-20
v3:2010-apr-30

From Terence’s stuff: You want proof?

hlee — Mon, 21 Dec 2009 00:27:30 +0000

Please, IMS Bulletin, v.38 (10) check p.11 of this pdf file for the whole article.

It is widely believed that under some fairly general conditions, MLEs are consistent, asymptotically normal, and efficient. Stephen Stigler has elegantly documented some of Fisher’s troubles when he wanted a proof. You want proof? Of course you can pile on assumptions so that the proof is easy. If checking your assumptions in any particular case is harder than checking the conclusion in that case, you will have joined a great tradition.
I used to think that efficiency was a thing for the theorists (I can live with inefficiency), that normality was a thing of the past (we can simulate), but that—in spite of Ralph Waldo Emerson—consistency is a thing we should demand of any statistical procedure. Not any more. These days we can simulate in and around the conditions of our data, and learn whether a novel procedure behaves as it should in that context. If it does, we might just believe the results of its application to our data. Other people’s data? That’s their simulation, their part of the parameter space, their problem. Maybe some theorist will take up the challenge, and study the procedure, and produce something useful. But if we’re still waiting for that with MLEs in general (canonical exponential families are in good shape), I wouldn’t hold my breath for this novel procedure. By the time a few people have tried the new procedure, each time checking its suitability by simulation in their context, we will have built up a proof by simulation. Shocking? Of course.
Some time into my career as a statistician, I noticed that I don’t check the conditions of a theorem before I use some model or method with a set of data. I think in statistics we need derivations, not proofs. That is, lines of reasoning from some assumptions to a formula, or a procedure, which may or may not have certain properties in a given context, but which, all going well, might provide some insight. The evidence that this might be the case can be mathematical, not necessarily with epsilon-delta rigour, simulation, or just verbal. Call this “a statistician’s proof ”. This is what I do these days. Should I be kicked out of the IMS?

After reading many astronomy literature, I develop a notion that astronomers like to use the maximum likelihood as a robust alternative to the chi-square minimization for fitting astrophysical models with parameters. I’m not sure it is truly robust because not many astronomy paper list assumptions and conditions for their MLEs.

Often I got confused with their target parameters. They are not parameters in statistical models. They are not necessarily satisfy the properties of probability theory. I often fail to find statistical properties of these parameters for the estimation. It is rare checking statistical modeling procedures with assumptions described by Prof. Speed. Even derivation is a bit short to be called “rigorous statistical analysis.” (At least I wish to see a sentence that “It is trivial to derive the estimator with this and that properties”).

Common phrases I confronted from astronomical literature is that authors’ strategy is statistically rigorous, superior, or powerful without showing why and how it is rigorous, superior, or powerful. I tried to convey these pitfalls and general restrictions in their employed statistical methods. Their strategy is not “statistically robust” nor “statistically powerful” nor “statistically rigorous.” Statisticians have own measures of “superiority” to discuss the improvement in their statistics, analysis strategies, and methodology.

It has not been easy since I never intend to case specific fault picking every time I see these statements. A method believed to be robust can be proven as not a robust method with your data and models. By simulations and derivations with the sufficient description of conditions, your excellent method can be presented with statistical rigors.

Within similar circumstances for statistical modeling and data analysis, there’s a trade off between robustness and conditions among statistical methodologies. Before stating a particular method adopted is robust or rigid, powerful or insensitive, efficient or inefficient, and so on; derivation, proof, or simulation studies are anticipated to be named the analysis and procedure is statistically excellent.

Before it gets too long, I’d like say that statistics have traditions for declaring working methods via proofs, simulations, or derivations. Each has their foundations: assumptions and conditions to be stated as “robust”, “efficient”, “powerful”, or “consistent.” When new statistics are introduced in astronomical literature, I hope to see some additional effort of matching statistical conditions to the properties of target data and some statistical rigor (derivations or simulations) prior to saying they are “robust”, “powerful”, or “superior.”

From Quantile Probability and Statistical Data Modeling

hlee — Sat, 21 Nov 2009 10:06:24 +0000

by Emanuel Parzen in Statistical Science 2004, Vol 19(4), pp.652-662 JSTOR

I teach that statistics (done the quantile way) can be simultaneously frequentist and Bayesian, confidence intervals and credible intervals, parametric and nonparametric, continuous and discrete data. My first step in data modeling is identification of parametric models; if they do not fit, we provide nonparametric models for fitting and simulating the data. The practice of statistics, and the modeling (mining) of data, can be elegant and provide intellectual and sensual pleasure. Fitting distributions to data is an important industry in which statisticians are not yet vendors. We believe that unifications of statistical methods can enable us to advertise, “What is your question? Statisticians have answers!”

I couldn’t help liking this paragraph because of its bitter-sweetness. I hope you appreciate it as much as I did.

The chance that A has nukes is p%

hlee — Fri, 23 Oct 2009 17:26:07 +0000

I watched a movie in which one of the characters said, “country A has nukes with 80% chance” (perhaps, not 80% but it was a high percentage). One of the statements in that episode is that people will not eat lettuce only if the 1% chance of e coli is reported, even lower. Therefore, with such a high percentage of having nukes, it is right to send troops to A. This episode immediately brought me a thought about astronomers’ null hypothesis probability and their ways of concluding chi-square goodness of fit tests, likelihood ratio tests, or F-tests.

First of all, I’d like to ask how you would like to estimate the chance of having nukes in a country? What this 80% implies here? But, before getting to the question, I’d like to discuss computing the chance of e coli infection, first.

From the frequentists perspective, computing the chance of e coli infection is investigating sample of lettuce and counts species that are infected: n is the number of infected species and N is the total sample size. 1% means one among 100. Such percentage reports and their uncertainties are very familiar scene during any election periods for everyone. From Bayesian perspective, Pr(p|D) ~ L(D|p) pi(p), properly choosing likelihoods and priors, one can estimate the chance of e coli infection and uncertainty. Understanding of sample species and a prior knowledge helps to determine likelihoods and priors.

How about the chance that country A has nukes? Do we have replicates of country A so that a committee investigate each country and count ones with nukes to compute the chance? We cannot do that. Traditional frequentist approach, based on counting, does not work here to compute the chance. Either using fiducial likelihood approach or Bayesian approach, i.e. carefully choosing a likelihood function adequately (priors are only for Bayesian) allows one to compuate such probability of interest. In other words, those computed chances highly depend on the choice of model and are very subjective.

So, here’s my concern. It seems like that astronomers want to know the chance of their spectral data being described by a model (A*B+C)*D (each letter stands for one of models such as listed in Sherpa Models). This is more like computing the chance of having nukes in country A, not counting frequencies of the event occurrence. On the other hand, p-value from goodness of fit tests, LRTs, or F-tests is a number from the traditional frequentists’ counting approach. In other words, p-value accounts for, under the null hypothesis (the (A*B+C)*D model is the right choice so that residuals are Gaussian), how many times one will observe the event (say, reduced chi^2 >1.2) if the experiments are done N times. The problem is that we only have one time experiment and that one spectrum to verify the (A*B+C)*D is true. Goodness of fit or LRT only tells the goodness or the badness of the model, not the statistically and objectively quantified chance.

In order to know the chance of the model (A*B+C)*D, like A has nuke with p%, one should not rely on p-values. If you have multiple models, one could compute pairwise relative chances i.e. odds ratios, or Bayes factors. However this does not provide the uncertainty of the chance (astronomers have the tendency of reporting uncertainties of any point estimates even if the procedure is statistically meaningless and that quantified uncertainty is not statistical uncertainty, as in using delta chi^2=1 to report 68% confidence intervals). There are various model selection criteria that cater various conditions embedded in data to make a right model choice among other candidate models. In addition, post-inference for astronomical models is yet a very difficult problem.

In order to report the righteous chance of (A*B+C)*D requires more elaborated statistical modeling, always brings some fierce discussions between frequentists and Bayesian because of priors and likelihoods. Although it can be very boring process, I want astronomers to leave the problem to statisticians instead of using inappropriate test statistics and making creative interpretation of statistics.

Please, keep this question in your mind when you report probability: what kind of chance are you computing? The chance of e coli infection? Or the chance that A has nukes? Make sure to understand that p-values from data analysis packages does not tell you that the chance the model (A*B+C)*D is (one minus p-value)%. You don’t want to report one minus p-value from a chi-square test statistic as the chance that A has nukes.

Scatter plots and ANCOVA

hlee — Thu, 15 Oct 2009 23:46:14 +0000

Astronomers rely on scatter plots to illustrate correlations and trends among many pairs of variables more than any scientists^[1]. Pages of scatter plots with regression lines are often found from which the slope of regression line and errors bars are indicators of degrees of correlation. Sometimes, too many of such scatter plots makes me think that, overall, resources for drawing nice scatter plots and papers where those plots are printed are wasted. Why not just compute correlation coefficients and its error and publicize the processed data for computing correlations, not the full data, so that others can verify the computation results for the sake of validation? A couple of scatter plots are fine but when I see dozens of them, I lost my focus. This is another cultural difference.

When having many pairs of variables that demands numerous scatter plots, one possibility is using parallel coordinates and a matrix of correlation coefficients. If Gaussian distribution is assumed, which seems to be almost all cases, particularly when parametrizing measurement errors or fitting models of physics, then error bars of these coefficients also can be reported in a matrix form. If one considers more complex relationships with multiple tiers of data sets, then one might want to check ANCOVA (ANalysis of COVAriance) to find out how statisticians structure observations and their uncertainties into a model to extract useful information.

I’m not saying those simple examples from wikipedia, wikiversity, or publicly available tutorials on ANCOVA are directly applicable to statistical modeling for astronomical data. Most likely not. Astrophysics generally handles complicated nonlinear models of physics. However, identifying dependent variables, independent variables, latent variables, covariates, response variables, predictors, to name some jargon in statistical model, and defining their relationships in a rather comprehensive way as used in ANCOVA, instead of pairing variables for scatter plots, would help to quantify relationships appropriately and to remove artificial correlations. Those spurious correlations appear frequently because of data projection. For example, datum points on a circle on the XY plane of the 3D space centered at zero, when seen horizontally, look like that they form a bar, not a circle, producing a perfect correlation.

As a matter of fact, astronomers are aware of removing these unnecessary correlations via some corrections. For example, fitting a straight line or a 2nd order polynomial for extinction correction. However, I rarely satisfy with such linear shifts of data with uncertainty because of changes in the uncertainty structure. Consider what happens when subtracting background leading negative values, a unrealistic consequence. Unless probabilistically guaranteed, linear operation requires lots of care. We do not know whether residuals y-E(Y|X=x) are perfectly normal only if μ and σs in the gaussian density function can be operated linearly (about Gaussian distribution, please see the post why Gaussianity? and the reference therein). An alternative to the subtraction is linear approximation or nonparametric model fitting as we saw through applications of principle component analysis (PCA). PCA is used for whitening and approximating nonlinear functional data (curves and images). Taking the sources of uncertainty and their hierarchical structure properly is not an easy problem both astronomically and statistically. Nevertheless, identifying properties of the observed both from physics and statistics and putting into a comprehensive and structured model could help to find out the causality^[2] and the significance of correlation, better than throwing numerous scatter plots with lines from simple regression analysis.

In order to understand why statisticians studied ANCOVA or, in general, ANOVA (ANalysis Of VAriance) in addition to the material in wiki:ANCOVA, you might want to check this page^[3] and to utilize your search engine with keywords of interest on top of ANCOVA to narrow down results.

From the linear model perspective, if a response is considered to be a function of redshift (z), then z becomes a covariate. The significance of this covariate in addition to other factors in the model, can be tested later when one fully fit/analyze the statistical model. If one wants to design a model, say rotation speed (indicator of dark matter occupation) as a function of redshift, the degrees of spirality, and the number of companions – this is a very hypothetical proposal due to my lack of knowledge in observational cosmology. I only want to point that the model fitting problem can be seen from statistical modeling like ANCOVA by identifying covariates and relationships – because the covariate z is continuous, and the degrees are fixed effect (0 to 7, 8 tuples), and the number of companions are random effect (cluster size is random), the comprehensive model could be described by ANCOVA. To my knowledge, scatter plots and simple linear regression are marginalizing all additional contributing factors and information which can be the main contributors of correlations, although it may seem Y and X are highly correlated in the scatter plot. At some points, we must marginalize over unknowns. Nonetheless, we still have some nuisance parameters and latent variables that can be factored into the model, different from ignoring them, to obtain advanced insights and knowledge from observations in many measures/dimensions.

Something, I think, can be done with a small/ergonomic chart/table via hypothesis testing, multivariate regression, model selection, variable selection, dimension reduction, projection pursuit, or names of some state of the art statistical methods, is done in astronomy with numerous scatter plots, with colors, symbols, and lines to account all possible relationships within pairs whose correlation can be artificial. I also feel that trees, electricity, or efforts can be saved from producing nice looking scatter plots. Fitting/Analyzing more comprehensive models put into a statistical fashion helps to identify independent variables or covariates causing strong correlation, to find collinear variables, and to drop redundant or uncorrelated predictors. Bayes factors or p-values can be assessed for comparing models, testing significance their variables, and computing error bars appropriately, not the way that the null hypothesis probability is interpreted.

Lastly, ANCOVA is a complete [MADS].

This is not an assuring absolute statement but a personal impression after reading articles of various fields in addition to astronomy. My readings of other fields tell that many rely on correlation statistics but less scatter plots by adding straight lines going through data sets for the purpose of imposing relationships within variable pairs
the way that chi-square fitting is done and the goodness-of-fit test is carried out is understood by the notion that X causes Y and by the practice that the objective function, the sum of (Y-E[Y|X])^2/σ^2 is minimized
It’s a website of Vassar college, that had a first female faculty in astronomy, Maria Mitchell. It is said that the first building constructed is the Vassar College Observatory, which is now a national historic landmark. This historical factor is the only reason of pointing this website to drag some astronomers attentions beyond statistics.

[MADS] logistic regression

hlee — Tue, 13 Oct 2009 20:15:08 +0000

Although a bit of time has elapsed since my post space weather, saying that logistic regression is used for prediction, it looks like still true that logistic regression is rarely used in astronomy. Otherwise, it could have been used for the similar purpose not under the same statistical jargon but under the Bayesian modeling procedures.

Maybe, some astronomers want to check out this versatile statistical method, wiki:logistic regression to see whether they can fit their data to this statistical method in order to model/predict observation rates, unobserved rates, undetected rates, detected rates, absorbed rates, and so on in terms of what are observed and additional/external observations, knowledge, and theories. I wonder what would it be like if the following is fit using logistic regression: detection limits, Eddington bias, incompleteness, absorption, differential emission measures, opacity, etc plus brute force Monte Carlo simulations emulating likely data to be fit. Then, responses are the probability of observed vs not observed as a function of redshift, magnitudes, counts, flux, wavelength/frequency, and other measurable variables or latent variables.

My simple reasoning that astronomers observe partially and they will never have complete sample, has imposed a prejudice that logistic regression would appear in astronomical literature rather frequently. Against my bet, it was [MADS]. All stat softwares have packages and modules for logistic regression; therefore, you have a data set, application is very straight forward.

———————————[added]
Although logistic regression models are given in many good tutorials, literature, or websites, it might be useful to have a simple but intuitive form of logistic regression for sloggers.

When you have binary responses, metal poor star (Y=1) vs. metal rich star (Y=2), and predictors, such as colors, distance, parallax, precision, and other columns in catalogs (X is a matrix comprised of these variables),
.
As astronomers fit a linear regression model to get the intercept and slope, the same approach is applied to get intercepts and coefficients of logistic regression models.

[Books] Bayesian Computations

hlee — Fri, 11 Sep 2009 20:40:23 +0000

A number of practical Bayesian data analysis books are available these days. Here, I’d like to introduce two that were relatively recently published. I like the fact that they are rather technical than theoretical. They have practical examples close to be related with astronomical data. They have R codes so that one can try algorithms on the fly instead of jamming probability theories.

Bayesian Computation with R
Author:Jim Albert
Publisher: Springer (2007)

As the title said, accompanying R package LearnBayes is available (clicking the name will direct you for downloading the package). Furthermore, the last chapter is about WinBUGS. (Please, check out resources listed in BUGS for other BUGS, Bayesian inference Using Gibbs Sampling) Overall, it is quite practical and instructional. If an young astronomer likes to enter the competition posted below because of sophisticated data requiring non traditional statistical modeling, this book can be a good starting. (Here, traditional methods include brute force Monte Carlo simulations, chi^2/weighted least square fitting, and test statistics with rigid underlying assumptions).

An interesting quote is filtered because of a comment from an astronomer, “Bayesian is robust but frequentist is not” that I couldn’t agree with at the instance.

A Bayesian analysis is said to be robust to the choice of prior if the inference is insensitive to different priors that match the user’s beliefs.

Since there’s no discussion of priors in frequentist methods, Bayesian robustness cannot be matched and compared with frequentist’s robustness. Similar to my discussion in Robust Statistics, I kept the notion that robust statistics is insensitive to outliers or iid Gaussian model assumption. Particularly, the latter is almost ways assumed in astronomical data analysis, unless other models and probability densities are explicitly stated, like Poisson counts and Pareto distribution. New Bayesian algorithms are invented to achieve robustness, not limited to the choice of prior but covering the topics from frequentists’ robust statistics.

The introduction of Bayesian computation focuses on analytical and simple parametric models and well known probability densities. These models and their Bayesian analysis produce interpretable results. Gibbs sampler, Metropolis-Hasting algorithms, and their few hybrids could handle scientific problems as long as scientific models and the uncertainties both in observations and parameters transcribed into well known probability density functions. I think astronomers like to check Chap 6 (MCMC) and Chap 9 (Regression Models). Often times, in order to prove strong correlation between two variables, astronomers adopt simple linear regression models and fit the data to these models. A priori knowledge enhances the flexibility of fitting analysis in which Bayesian computation works robustly different from straightforward chi-square methods. The book does not have sophisticated algorithms nor theories. It only offers very necessities and foundations for Bayesian computations to be accommodated into scientific needs.

The other book is

Bayesian Core: A Practical Approach to Computational Bayesian Statistics.
Author: J. Marin and C.P.Robert
Publisher: Springer (2007).

Although the book is written by statisticians, the very first real data example is CMBdata (cosmic microwave background data; instead of cosmic, the book used cosmological. I’m not sure which one is correct but I’m so used to CMB by cosmic microwave background). Surprisingly, CMB became a very easy topic in statistics in terms of testing normality and extreme values. Seeing the real astronomy data first from the book was the primary reason of introducing this book. Also, it’s a relatively small volume book (about 250 pages) compared other Bayesian textbooks with the broad coverage of topics in Bayesian computation. There are other practical real data sets to illustrate Bayesian computations in the book and these example data sets are found from the book website

The book begins with R, then normal models, regression and variable selection, generalized linear models, capture-recapture experiments, mixture models, dynamic models, and image analysis are covered.

I feel exuberant when I found the book describes the law of large numbers (LLN) that justifies the Monte Carlo methods. The LLN appears often when integration is approximated by summation, which astronomers use a lot without referring the name of this law. For more information, I rather give a wikipedia link to Law of Large Numbers.

Several MCMC algorithms can be mixed together within a single algorithm using either a circular or a random design. While this construction is often suboptimal (in that the inefficient algorithms in the mixture are still used on a regular basis), it almost always brings an improvement compared with its individual components. A special case where a mixed scenario is used is the Metropolis-within-Gibbs algorithm: When building a Gibbs sample, it may happen that it is difficult or impossible to simulate from some of the conditional distributions. In that case, a single Metropolis step associated with this conditional distribution (as its target) can be used instead.

Description in Sec. 4.2 Metropolis-Hasting Algorithms is expected to be more appreciated and comprehended by astronomers because of the historical origins of these topics, detailed balance equation and random walk.

Personal favorite is section 6 on mixture models. Astronomers handle data of multi populations (multiple epochs of star formations, single or multiple break power laws, linear or quadratic models, metalicities from merging or formation triggers, backgrounds+sources, environment dependent point spread functions, and so on) and discusses the difficulties of label switching problems (identifiability issue in codifying data into MCMC or EM algorithm)

A completely different approach to the interpretation and estimation of mixtures is the semiparametric perspective. To summarize this approach, consider that since very few phenomena obey probability laws corresponding to the most standard distributions, mixtures such as (*) can be seen as a good trade-off between fair represntation of the phenomenon and efficient estimation of the underlying distribution. If k is large enough, there is theoretical support for the argument that (*) provides a good approximation (in some functional sense) to most distributions. Hence, a mixture distribution can be perceived as a type of basis approximation of unknown distributions, in a spirit similar to wavelets and splines, but with a more intuitive flavor (for a statistician at least). This chapter mostly focuses on the “parametric” case, when the partition of the sample into subsamples with different distributions f_j does make sense form the dataset point view (even though the computational processing is the same in both cases).

We must point at this stage that mixture modeling is often used in image smoothing but not in feature recognition, which requires spatial coherence and thus more complicated models…

My patience ran out to comprehend every detail of the book but the section of reversible jump MCMC, hidden Markov model (HMM), and Markov random fields (MRF) would be very useful. Particularly, these topics appear often in image processing, which field astronomers have their own algorithms. Adaption and comparison across image analysis methods promises new directions of scientific imaging data analysis beyond subjective denoising, smoothing, and segmentation.

Readers considering more advanced Bayesian computation and rigorous treatment of MCMC methodology, I’d like to point a textbook, frequently mentioned by Marin and Robert.

Monte Carlo Statistical Methods Robert, C. and Casella, G. (2004)
Springer-Verlag, New York, 2nd Ed.

There are a few more practical and introductory Bayesian Analysis books recently published or soon to be published. Some readership would prefer these books of running ink. Perhaps, there is/will be Bayesian Computation with Python, IDL, Matlab, Java, or C/C++ for those who never intend to use R. By the way, for Mathematica users, you would like to check out Phil Gregory’s book which I introduced in [books] a boring title. My point is that applied statistics has become more friendly to non statisticians through these good introductory books and free online materials. I hope more astronomers apply statistical models in their data analysis without much trouble in executing Bayesian methods. Some might want to check BUGS, introduced [BUGS]. This posting contains resources of how to use BUGS and available packages under languages.

Wavelet-regularized image deconvolution

hlee — Fri, 12 Jun 2009 20:47:36 +0000

A Fast Thresholded Landweber Algorithm for Wavelet-Regularized Multidimensional Deconvolution
Vonesch and Unser (2008)
IEEE Trans. Image Proc. vol. 17(4), pp. 539-549

Quoting the authors, I also like to say that the recovery of the original image from the observed is an ill-posed problem. They traced the efforts of wavelet regularization in deconvolution back to a few relatively recent publications by astronomers. Therefore, I guess the topic and algorithm of this paper could drag some attentions from astronomers.

They explain the wavelet based reconstruction procedure in a simple term. The matrix-vector product w_x= Wx yields the coefficients of x in the wavelet basis, and W^TWx reconstructs the signal from these coefficients.

Their assumed model is

y=Hx_orig + b,

where y and x_{orig} are vectors containing uniform samples of the original and measured signals; b represents the measurement error. H is a square (block) circulant matrix that approximates the convolution with the PSF. Then, the problem of deconvolution is to find an estimate that maximizes the cost function

J(x) = J_data(x)+ λ J_reg(x)

They described that “this functional can also interpreted as a (negative) log-likelihood in a Bayesian statistical framework, and deconvolution can then be seen as a maximum a posteriori (MAP) estimation problem.” Also the description of the cost function is applicable to the frequently appearing topic in regression or classification problems such as ridge regression, quantile regression, LASSO, LAR, model/variable selection, state space models from time series and spatial statistics, etc.

The observed image is the d-dimensional covolution of an origianl image (the characteristic function of the object of interest) with the impulse response (or PSF). of the imaging system.

The notion of regularization or penalizing the likelihood seems not well received among astronomers based on my observation that often times the chi-square minimization (the simple least square method) without penalty is suggested and used in astronomical data analysis. Since image analysis with wavelets popular in astronomy, the fast algorithm for wavelet regularized variational deconvolution introduced in this paper could bring faster results to astronomers and could offer better insights of the underlying physical processes by separating noise and background more in a model according fashion, not simple background subtraction.

Curious Cases of the Null Hypothesis Probability

hlee — Tue, 02 Jun 2009 08:03:13 +0000

Even though I traced the astronomers’ casual usage of the null hypothesis probability in a fashion of reporting outputs from data analysis packages of their choice, there were still some curious cases of the null hypothesis probability that I couldn’t solve. They are quite mysterious to me. Sometimes too much creativity harms the original intention. Here are some examples.

Full text search in ADS with “null hypothesis probability” yield 77 related articles (link removed. Search results are floating urls, probably?). Many of them contained the phrase “null hypothesis probability” as it is. The rest were in the context of “given the null hypothesis, the probability of …” I’m not sure this ADS search result includes null hypothesis probability written in tables and captions. It’s possible more than 77 could exist. The majority of articles with the “null hypothesis probability” are just reporting numbers from screen outputs from the chosen data analysis system. Discussions and interpretations of these numbers are more focused toward reduced χ² close to ONE, astronomers’ most favored model selection criterion. Sometimes, I got confused with the goal of their fitting analysis because the driven force is that “make the reduced chi-square closed to one and make residuals look good“. Instead of being used for statistical inferences and measures, a statistic works as an objective function. Numerically (chi-square) or pictorially (residuals) is overshadowed the fundamentals that you observed relatively low number of photons under Poisson distribution and those photons are convolved with complicated instruments. It is possible to underestimated statistically, the reduced chi-sq is off from the unity but based on robust statistics, one still can say the model is a good fit.

Instead of talking about the business of the chi-square method, one thing I wanted to point out from this “null hypothesis probability” investigation is that there was a big presenting style and field distinction between papers of the null hypothesis probability (spectral model fitting) and of given the null hypothesis, the probability of (cosmology). Beyond this casual and personal finding about the style difference, the following quotes despaired me because I couldn’t find answers from statistics.

MNRAS, v.340, pp.1261-1268 (2003): The temperature and distribution of gas in CL 0016+16 measured with XMM-Newton (Worrall and Birkinshaw)

With reduced chi square of 1.09 (chi-sq=859 for 786 d.o.f) the null hypothesis probability is 4 percent but this is likely to result from the high statistical precision of the data coupled with small remaining systematic calibration uncertainties

I couldn’t understand why p-value=0.04 is associated with high statistical precision of the data coupled with small remaining systematic calibration uncertainties. Is it a polite way to say the chi-square method is wrong due to systematic uncertainty? Or does this mean the stat uncertainty is underestimated due the the correlation with sys uncertainty? Or other than p-value, does the null hypothesis probability has some philosophical meanings? Or … I may go on with strange questions due to the statistical ambiguity of the statement. I’d appreciate any explanation how the p-value (the null hypothesis probability) is associated with the subsequent interpretation.

Another miscellaneous question is that If the number (the null hypothesis probability) from software packages is unfavorable or uninterpretable, can we attribute such ambiguity to systematical error?
MNRAS, v. 345(2),pp.423-428 (2003): Iron K features in the hard X-ray XMM-Newton spectrum of NGC 4151 (Schurch, Warwick, Griffiths, and Sembay)

The result of these modifications was a significantly improved fit (chi-sq=4859 for 4754 d.o.f). The model fit to the data is shown in Fig. 3 and the best-fitting parameter values for this revised model are listed as Model 2 in Table 1. The null hypothesis probability of this latter model (0.14) indicates that this is a reasonable representation of the spectral data to within the limits of the instrument calibration.

What is the rule of thumb interpretation of p-values or this null hypothesis probability in astronomy? How one knows that it is reasonable as authors mentioned? How one knows the limits of the instrument calibration and compares quantitatively? How about the degrees of freedom? Some thousands! So large. Even with a million photons, according to the guideline for the number of bins^[1] I doubt that using chi-square goodness of fit for data with such large degree of freedom makes the test too conservative. Also, there should be distinction between the chi square minimization tactic and the chi square goodness of fit test. Using same data for both procedures will introduce bias.
MNRAS, v. 354, pp.10-24 (2004): Comparing the temperatures of galaxy clusters from hdrodynamical N-body simulations to Chandra and XMM-Newton observations (Mazzotta, Rasia, Moscardini, and Tormen)

In particular, solid and dashed histograms refer to the fits for which the null hypothesis has a probiliy >5 percent (statistically acceptable fit) or <5 percent (statistically unacceptable fit), respectively. We also notice that the reduced chi square is always very close to unity, except in a few cases where the lower temperature components is at T~2keV, …

The last statement obscures the interpretation even more to the statement related to what “statistically (un)acceptable fit” really means. The notion of how good a model fits to data and how to test such hypothesis from the statistics standpoint seems different from that of astronomy.
MNRAS, v.346(4),pp.1231-1241: X-ray and ultraviolet observations of the dwarf nova VW Hyi in quiescence (Pandel, Córdova, and Howell)

As can be seen in the null hypothesis probabilities, the cemekl model is in very good agreement with the data.

The computed null hypothesis probabilities from the table 1 are 8.4, 25.7, 42.2, 1.6, 0.7*, and 13.1 percents (* is the result of MKCFLOW model, the rest are CEMEKL model). Probably, the criterion to declare a good fit is a p-value below 0.01 so that CEMEKL model cannot be rejected but MKCFLOW model can be rejected. Only one MKCFLOW which by accident resulted in a small p-value to say that MKCFLOW is not in agreement but the other choice, CEMEKL model is a good model. Too simplified model selection/assessment procedure. I wonder why CEMEKL was tried with various settings but MKCFLOW was only once. I guess there’s is an astrophysical reason of executing such model comparison study but statistically speaking, it looks like comparing measurement of 5 different kinds of oranges and one apple measured by a same ruler (the null hypothesis probability from the chi-square fitting). From the experimental design viewpoint, this is not well established study.
MNRAS, 349, 1267 (2004): Predictions on the high-frequency polarization properties of extragalactic radio sources and implications for polarization measurements of the cosmic microwave background (Tucci et al.)

The correlation is less clear in the samples at higher frequencies (r~ 0.2 and a null-hypothesis probability of >10^{-2}). However, these results are probably affected by the variability of sources, because we are comparing data taken at different epochs. A strong correlation (r>0.5 and a null-hypothesis probability of <10^{-4}) between 5 and 43 GHz is found for the VLA calibrators, the polarization of which is measured simultaneously at all frequencies.

I wonder what test statistic has been used to compute those p-values. I wonder if they truly meant p-value>0.01. At this level, most tools offer more precise number so as to make a suitable statement. The p-value (or the “null hypothesis probability”) is for testing whether r=0 or not. Even r is small, 0.2, still one can reject the null hypothesis if the threshold is 0.05. Therefore, >10^{-2} only add ambiguity. I think point estimates are enough to report the existence of weak and rather strong correlations. Otherwise, reporting both p-values and powers seems more appropriate.
A&A, 342, 502 (1999): X-ray spectroscopy of the active dM stars: AD Leo and EV Lac
(S. Sciortino, A. Maggio, F. Favata and S. Orlando)

This fit yields a value of chi square of 185.1 with 145 υ corresponding to a null-hypothesis probability of 1.4% to give an adequate description of the AD Leo coronal spectrum. In other words the adopted model does not give an acceptable description of available data. The analysis of the uncertainties of the best-fit parameters yields the 90% confidence intervals summarized in Table 5, together with the best-fit parameters. The confidence intervals show that we can only set a lower-limit to the value of the high-temperature. In order to obtain an acceptable fit we have added a third thermal MEKAL component and have repeated the fit leaving the metallicity free to vary. The resulting best-fit model is shown in Fig. 7. The fit formally converges with a value of chi square of 163.0 for 145 υ corresponding to a probability level of ~ 9.0%, but with the hotter component having a “best-fit” value of temperature extremely high (and unrealistic) and essentially unconstrained, as it is shown by the chi square contours in Fig. 8. In summary, the available data constrain the value of metallicity to be lower than solar, and they require the presence of a hot component whose temperature can only be stated to be higher than log (T) = 8.13. Available data do not allow us to discriminate between the (assumed) thermal and a non-thermal nature of this hot component.
…The fit yields a value of [FORMULA] of 95.2 (for 78 degree of freedom) that corresponds to a null hypothesis probability of 2.9%, i.e. a marginally acceptable fit. The limited statistic of the available spectra does not allow us to attempt a fit with a more complex model.

After adding MEKAL, why the degree of freedom remains same? Also, what do they mean by the limited statistic of the available spectra?
MNRAS348, 529 (2004):Powerful, obscured active galactic nuclei among X-ray hard, optically dim serendipitous Chandra sources (Gandhi, Crawford, Fabian, Johnstone)

…, but a low f-test probability for this suggests that we cannot constrain the width with the current data.
While the rest frame equivalent width of the line is close to 1keV, its significance is marginal (f-test gives a null hypothesis probability of 0.1).

Without a contingency table, nor comparing models, I was not sure how they executed the F-test. I could not find two degrees of freedom for the F-test. From the XSPEC’s account for the F-test (http://heasarc.gsfc.nasa.gov/docs/xanadu/xspec/manual/XSftest.html), we see two degrees of freedom, without them, no probability can be computed. Their usage of the F-test seems unconventional. The conventional application of the F-test is for comparing effects of multiple treatments (different levels of drug dosage including placebo); otherwise, it’s just a chi square goodness of fit test or t-test.
Another occasion I came across is interpreting the null hypothesis probability of 0.99 as an indicator of a good fit; well, it’s overfitting. Not only too small null hypothesis probability but also close to one null hypothesis probability should raise a flag for cautions and warnings because the later indicating you are overdoing (too many free parameters for example).

There are some residuals of ambiguity after deducing the definition of the null hypothesis probability by playing with XSPEC and finding cases how this null hypothesis probability is used in literature. Authors sometimes added creative comments in order to interpret the null hypothesis probability from their data analysis, which I cannot understand without statistical imagination. Most can be overlooked, perhaps. Or instead, they are rather to be addressed to astronomers with statistical knowledge to resolve my confusion by the null hypothesis probability. I expect comments on how to view these quotes with statistical rigor from astronomers. The listed are personal. There are some more I really didn’t understand the points but many were straightforward in using the null hypothesis probabilities as p-values in statistical inference under the simple null hypothesis. I just listed some to display my first impression on these quotes most of which I couldn’t draw statistical caricatures out of them. Eventually, I hope some astronomers straighten the meaning and the usage of the null hypothesis probability without overruling basics in statistics.

I only want to add a caution when using the reduced chi-square as a model selection criteria. An indicator of a good-fit from a reduced chi^2 close to unity is only true when grouped data are independent so that the formula of degrees of freedom, roughly, the number of groups minus the number of free parameters, is valid. Personally I doubt this rule applied in spectral fitting that one cannot expect independence between two neighboring bins. In other words, given a source model and given total counts, two neighboring observations (counts in two groups) are correlated. The grouping rules like >25 or S/N>3 do not guarantee the independent assumption for the chi-square goodness of fit test although it may sufficient for Gaussian approximation. Statisticians devised various penalty terms and regularization methods for model selection that suits data types. One way to look is computing proper degrees of freedom, called effective degrees of freedom instead of n-p, to reflect the correlation across groups because of the chosen source model and calibration information. With a large number of counts or large number of groups, unless properly penalized, it is likely that the chi-square fit is hard to reject the null hypothesis than a statistic with smaller degrees of freedom because of the curse of dimensionality.

Mann and Wald (1942), “On the Choice of the Number of Class Intervals in the Application of the Chi-square Test” Annals of Math. Stat. vol. 13, pp.306-7.

4754 d.f.

hlee — Tue, 17 Mar 2009 19:37:44 +0000

I couldn’t believe my eyes when I saw 4754 degrees of freedom (d.f.) and chi-square test statistic 4859. I’ve often enough seen large degrees of freedom from journals in astronomy, several hundreds to a few thousands, but I never felt comfortable at these big numbers. Then with a great shock 4754 d.f. appeared. I must find out why I feel so bothered at these huge degrees of freedom.

When I was learning statistics, I never confronted such huge degrees of freedom. Well, given the facts that only a small amount of time is used for learning the chi-square goodness-of-fit test, that the chi-square distribution is a subset of gamma distribution, and that statisticians do not handle a hundred of thousands (there are more low count spectra but I’ll discuss why I chose this big number later) of photons from X-ray telescopes, almost surely no statistician would confront such huge degrees of freedom.

Degrees of freedom in spectral fitting are combined results of binning (or grouping into n classes) and the number of free parameters (p), i.e. n-p-1. Those parameters of interest, targets to be optimized or to be sought for solutions are from physical source models, which are determined by law of physics. Nothing to be discussed from the statistical point of view about these source models except the model selection and assessment side, which seems to be almost unexplored area. On the other hand, I’d like to know more about binning and subsequent degrees of freedom.

A few binning schemes in spectral analysis that I often see are each bin having more than 25 counts (the same notion of 30 in statistics for CLT or the last number in a t-table) or counts in each bin satisfying a certain signal to noise ratio S/N level. For the latter, it is equivalent that sqrt(expected counts) is larger than the given S/N level since photon counts are Poisson distributed. There are more sophisticated adaptive binning strategies but I haven’t found mathematical, statistical, nor computational algorithmic justifications for those. They look empirical procedures to me that are discovered after many trials and errors on particular types of spectra (I often become suspicious if I can reproduce the same goodness of fit results with the same ObsIDs as reported in those publications). The point is that either simple or complex, at the end, if someone has a data file with large number of photons, n is generally larger than observations with sparse photons. This is the reason I happen to see inconceivable d.f.s to a statistician from some papers, like 4754.

First, the chi-square goodness of fit test was designed for agricultural data (or biology considering Pearson’s eugenics) where the sample size is not a scale of scores of thousands. Please, note that bin in astronomy is called cell (class, interval, partition) in statistical papers and books showing applications of chi-square goodness fit tests.

I also like to point out that the chi-square goodness of fit test is different from the chi-square minimization even if they share the same equation. The former is for hypothesis testing and the latter is for optimization (best fit solution). Using the same data for optimization and testing introduces bias. That’s one of the reasons why with large number of data points, cross validation techniques are employed in statistics and machine learning^[1]. Since I consider binning as smoothing, the optimal number of bins and their size depends on data quality and source model property as is done in kernel density estimation or imminently various versions of chi-square tests or distance based nonparametric tests (K-S test, for example).

Although published many decades ago, you might want to check this paper out to get a proper rule of thumb for the number of bins:
“On the choice of the number of class intervals in the application of the chi square test” (JSTOR link) by Mann and Wald in The Annals of Mathematical Statistics, Vol. 13, No. 3 (Sep., 1942), pp. 306-317 where they showed that the number of classes is proportional to N^(2/5) (The underlying idea about the chi-square goodness of fit tests, detailed derivation, and exact equation about the number of classes is given in detail) and this is the reason why I chose a spectrum of 10^5 photons at the beginning. By ignoring other factors in the equation, 10^5 counts roughly yields 100 bins. About 4000 bins implies more than a billion photons, which seems a unthinkable number in X-ray spectral analysis. Furthermore, many reports said Mann and Wald’s criterion results in too many bins and loss of powers. So, n is subject to be smaller than 100 for 10^5 photons.

The other issue with statistical analysis on X-ray spectra is that although photons in each channel/bin can be treated as independent sample but the expected numbers of photons across bins are related via physical source model or so called link function borrowed from generalized linear model. However, well studied link functions in statistics do not match source models in high energy astrophysics. Typically, source models are not analytical. They are non-linear, numerical, tabulated, or black box type that are incompatible with current link functions in generalized linear model that is a well developed, diverse, and robust subject in statistics for inference problems. Therefore, binning data and chi-square minimization seems to be an only strategy for statistical inference about parameters in source models so far (for some “specific” statistical or physical models, this is not true, which is not a topic of this discussion). Mann and Wald’s method for class size assumes equiprobable bins whereas channel or bin probabilities in astronomy would not satisfy the condition. The probability vector of multinomial distribution depends on binning, detector sensitivity, and source model instead of the equiprobable constraint from statistics. Well, it is hard to device an purely statistically optimal binning/grouping method for X-ray spectral analysis.

Instead of individual group/bin dependent smoothing (S/N>3 grouping, for example), I, nevertheless, wish for developing binning/grouping schemes based on total sample size N particularly when N is large. I’m afraid that with the current chi-square test embedded in data analysis packages, the power of a chi-square statistic is so small and one will always have a good reduced chi-square value (astronomers’ simple model assessment tool: the measure of chi-square statistic divided by degrees of freedom and its expected value is one. If the reduced chi-square criterion is close to one, then the chosen source model and solution for parameters is considered to be best fit model and value). The fundamental idea of suitable number of bins is equivalent to optimal bandwidth problems in kernel density estimation, of which objective is accentuating the information via smoothing; therefore, methodology developed in the field of kernel density estimation may suggest how to bin/group the spectrum while preserving the most of information and increasing the efficiency. A modified strategy for binning and applying the chi-square test statistic for assessing model adequacy should be conceived instead of reporting thousands of degrees of freedom.

I think I must quit before getting too bored. Only I’d like to mention quite interesting papers that cited Mann and Wald (1942) and explored the chi square goodness of fit including Johnson’s A Bayesian chi-square test for Goodness-of-Fit (a link is made to the arxiv pdf file) which might provide more charm to astronomers who like to modify their chi-square methods in a Bayesian way. A chapter “On the Use and Misuse of Chi-Square” (link to google book excerpt) by KL Delucchi in A Handbook for Data Analysis in the Behavioral Sciences (1993) reads quite intriguing although the discussion is a reminder for behavior scientists.

Lastly, I’m very sure that astronomers explored properties of the chi-square statistic and chi-square type tests with their data sets. I admit that I didn’t make an expedition for such works since those are few needles in a mound of haystack. I’ll be very delighted to see an astronomers’ version of “use and misuse of chi-square,” a statistical account for whether the chi-square test with huge degrees of freedom is powerful enough, or any advice on that matter will be very much appreciated.

a rough sketch of cross validation: assign data into a training data set and a test set. get the bet fit from the training set and evaluate the goodness-of-fit with that best fit with the test set. alternate training and test sets and repeat. wiki:cross_validationa