The AstroStat Slog » Bad AstroStat

Quotes from Common Errors in Statistics

hlee — Fri, 13 Nov 2009 17:13:01 +0000

by P.I.Good and J.W.Hardin. Publisher’s website

My astronomer neighbor mentioned this book a while ago and quite later I found intriguing quotes.

GIGO: Garbage in; garbage out. Fancy statistical methods will not rescue garbage data. Course notes of Raymond J. Carroll (2001)

I often see a statement like data were grouped/binned to improve statistics. This seems hardly true unless the astronomer knows the true underlying population distribution from which those realizations (either binned or unbinned) are randomly drawn. Nonetheless, smoothing/binning (modifying sample) can help hypothesis testing to infer the population distribution. This validation step is often ignored, though. For the righteous procedures of statistics application, I hope astronomers adopt the concepts in the design of experiments to collect good quality data without wasting resources. What I mean by wasting resources is that, due to the instrumental and atmospheric limitations, indefinite exposure is not necessary to collect good quality image. Instead of human eye inspection, machine can do the job. I guess that minimax type optimal points exist for operating telescopes, feature extraction/detection, or model/data quality assessment. Clarifying the sources of uncertainty and stratifying them for testing, sampling, and modeling purposes as done in analysis of variance is quite unexplored in astronomy. Instead, more efforts go to salvaging garbage and so far, many gems are harvested by tremendous amount of efforts. But, I’m afraid that it could get as much painful as gold miners’ experience during the mid 19th century gold rush.

Interval Estimates (p.51)
A common error is to specify a confidence interval in the form (estimate – k*standard error, estimate+k*standard error). This form is applicable only when an interval estimate is desired for the mean of a normally distributed random variable. Even then k should be determined for tables of the Student’s t-distribution and not from tables of the normal distribution.

How to get appropriate degrees of freedom seems most relevant to avoid this error when estimates are the coefficients of complex curves or equation/model itself. The t-distribution with a large d.f. (>30) is hardly discernible from the z-distribution.

Desirable estimators are impartial,consistent, efficient, robust, and minimum loss. Interval estimates are to be preferred to point estimates; they are less open to challenge for they convey information about the estimate’s precision.

Every Statistical Procedure Relies on Certain Assumptions for correctness.

What I often fail to find from astronomy literature are these assumptions. Statistics is not elixir to every problem but works only on certain conditions.

Know your objectives in testing. Know your data’s origins. Know the assumptions you feel comfortable with. Never assign probabilities to the true state of nature, but only to the validity of your own predictions. Collecting more and better data may be your best alternative

Unfortunately, the last sentence is not an option for astronomers.

From Guidelines for a Meta-Analysis
Kepler was able to formulate his laws only because (1) Tycho Brahe has made over 30 years of precise (for the time) astronomical observations and (2) Kepler married Brahe’s daughter and thus gained access to his data.

Not exactly same but it reflects some reality of contemporary. Without gaining access to data, there’s not much one can do and collecting data is very painstaking and time consuming.

From Estimating Coefficient
…Finally, if the error terms come from a distribution that is far from Gaussian, a distribution that is truncated, flattened or asymmetric, the p-values and precision estimates produced by the software may be far from correct.

Please, double check numbers from your software.

To quote Green and Silverman (1994, p. 50), “There are two aims in curve estimation, which to some extent conflict with one another, to maximize goodness-of-fit and to minimize roughness.

Statistically significant findings should serve as a motivation for further corroborative and collateral research rather than as a basis for conclusions.

To be avoided are a recent spate of proprietary algorithms available solely in software form that guarantee to find a best-fitting solution. In the worlds of John von Neumann, “With four parameters I can fit an elephant and with five I can make him wiggle his trunk.” Goodness of fit is no guarantee of predictive success, …

If physics implies wiggles, then there’s nothing wrong with an extra parameter. But it is possible that best fit parameters including these wiggles might not be the ultimate answer to astronomers’ exploration. It can be just a bias due to introducing this additional parameter for wiggles in the model. Various statistical tests are available and caution is needed before reporting best fit parameter values (estimates) and their error bars.

[MADS] plug-in estimator

hlee — Tue, 21 Apr 2009 02:34:40 +0000

I asked a couple of astronomers if they heard the term plug-in estimator and none of them gave me a positive answer.

When computing sample mean (xbar) and sample variance (s^2) to obtain (xbar-s, xbar+s) and to claim this interval covers 68%, these sample mean, sample variance, and the interval are plug-in estimators. Whilst clarifying the form of sampling distribution, or on verifying the formulas or estimators of sample mean and sample variance truly match with true mean and true variance, I can drop plug-in part because I know asymptotically such interval (estimator) will cover 68%.

When there is lack of sample size or it is short in sufficing (theoretic) assumptions, instead of saying 1-σ, one would want to say s, or plug-in error estimator. Without knowing the true distribution (asymptotically, the empirical distribution), somehow 1-σ mislead that best fit and error bar assures 68% coverage, which is not necessary true. What is computed/estimated is s or a plug-in estimator that is defined via Δ chi-square=1. Generally, the Greek letter σ in statistics indicate parameter, not a function of data (estimator), for instance, sample standard deviation (s), root mean square error (rmse), or the solution of Δ chi-square=1.

Often times I see extended uses of statistics and their related terms in astronomical literature which lead unnecessary comments and creative interpretation to account for unpleasant numbers. Because of the plug-in nature, the interval may not cover the expected value from physics. It’s due to chi-square minimization (best fit can be biased) and data quality (there are chances that data contain outliers or go under modification through instruments and processing). Unless robust statistics is employed (outliers could shift best fits and robust statistics are less sensitive to outliers) and calibration uncertainty or some other correction tools are suitably implemented, strange intervals are not necessary to be followed by creative comments or to be discarded. Those intervals are by products of employing plug-in estimators whose statistical properties are unknown during the astronomers’ data analysis state. Instead of imaginative interpretations, one would proceed with investigating those plug-in estimators and try to device/rectify them in order to make sure they lead close to the truth.

For example, instead of simple average (xbar=f(x_1,…,x_n) :average is a function of data whereas the chi-square minimzation method is another function of data), whose breakdown point is asymptotically zero and can be off from the truth, median (another function of data) could serve better (breakdown point is 1/2). We know that the chi-square methods are based on L2 norm (e.g. variation of least square methods). Instead, one can develop methods based on L1 norm as in quantile regression or least absolute deviation (LAD, in short, link from wiki). There are so many statistics are available to walk around short comings of popular plug-in estimators when sampling distribution is not (perfect) gaussian or analytic solution does not exist.

Use and Misuse of Chi-square

hlee — Tue, 31 Mar 2009 19:43:40 +0000

Before using any adaptations of chi-square statistic, please spend a minute or two to ponder whether your strategy with chi-square belongs one of these categories.

1. Lack of independence among the single events or measures
2. Small theoretical frequencies
3. Neglect of frequencies of non-occurrence
4. Failure to equalize \sum O_i (the sum of the observed frequencies) and \sum M_i (the sum of the theoretical frequencies)
5. Indeterminate theoretical frequencies
6. Incorrect or questionable categorizing
7. Use of non-frequency data
8. Incorrect determination of the number of degrees of freedom
9. Incorrect computations (including a failure to weight by N when proportions instead of frequencies are used in the calculations)

From “Chapter 10: On the Use and Misuse of Chi-square” by K.L.Delucchi in A Handbook for Data Analysis in the Behavioral Sciences (1993). Delucchi acknowledged these nine principle sources of error to Lewis and Burke (1949), entitled “The Use and Misuse of the Chi-square” published in Psychological Bulletin.

As described in my post, 4754 d.f., 2 is not a concern if any grouping schemes like >25 per bin is employed. As far as type I error and power is considered, 5 (10) or more in each bin is suggested from the literature of other sciences and astronomers adopt 20 or 25 according to publications in astronomy. However, I do care when grouping the insensitive part of detector channels that could be associated with 1, 3, 5 and 7 so that the chi-square statistic becomes inadequate. 8 and 9 are also done by computer so no worries. 6 is not applicable for astronomers in general because categorical data analysis is not a main subject of spectral or light curve analysis (For those who are curious about categorical data analysis, see a book by Alan Agresi, titled Categorical Data Analysis -Amazon link). Now, 1,3,4,5, and 7 are left among nine categories. One way or the other, they are intertwined due to different detector sensitivity and source models. It is hard to straighten out these categories in terms of X-ray spectral and light curve fitting in order to replace terms in behavior science. Therefore, I’d rather focus on 4.

I wonder if XSPEC and Sherpa offers a tool to check the balance between the sum of observed counts and the sum of expected (model) counts. I wonder if people check this condition when they apply chi-square statistics (not chi-square minimization, and I stated the difference in my post). I don’t think it’s easy as stated in other sciences of surveys and categorical data because high energy astrophysics has effective area, redistribution matrix, and point spread function which are non-linear and add uncertainties to the counts of each bin and as a consequence, the sum of counts. On the other hand, unless the difference is zero, it is obvious that chi-square statistic is biased and all the subsequent inference results like p-values and confidence intervals do not serve the way that they are meant to be.

My empathy toward the prevailed chi-square statistic in astronomy is expressed in Delucchi.

Like the good-natured next door neighbor who always lends a hand without complaining, however, the chi-square statistic is easy to take for granted and easy to misuse.

4754 d.f.

hlee — Tue, 17 Mar 2009 19:37:44 +0000

I couldn’t believe my eyes when I saw 4754 degrees of freedom (d.f.) and chi-square test statistic 4859. I’ve often enough seen large degrees of freedom from journals in astronomy, several hundreds to a few thousands, but I never felt comfortable at these big numbers. Then with a great shock 4754 d.f. appeared. I must find out why I feel so bothered at these huge degrees of freedom.

When I was learning statistics, I never confronted such huge degrees of freedom. Well, given the facts that only a small amount of time is used for learning the chi-square goodness-of-fit test, that the chi-square distribution is a subset of gamma distribution, and that statisticians do not handle a hundred of thousands (there are more low count spectra but I’ll discuss why I chose this big number later) of photons from X-ray telescopes, almost surely no statistician would confront such huge degrees of freedom.

Degrees of freedom in spectral fitting are combined results of binning (or grouping into n classes) and the number of free parameters (p), i.e. n-p-1. Those parameters of interest, targets to be optimized or to be sought for solutions are from physical source models, which are determined by law of physics. Nothing to be discussed from the statistical point of view about these source models except the model selection and assessment side, which seems to be almost unexplored area. On the other hand, I’d like to know more about binning and subsequent degrees of freedom.

A few binning schemes in spectral analysis that I often see are each bin having more than 25 counts (the same notion of 30 in statistics for CLT or the last number in a t-table) or counts in each bin satisfying a certain signal to noise ratio S/N level. For the latter, it is equivalent that sqrt(expected counts) is larger than the given S/N level since photon counts are Poisson distributed. There are more sophisticated adaptive binning strategies but I haven’t found mathematical, statistical, nor computational algorithmic justifications for those. They look empirical procedures to me that are discovered after many trials and errors on particular types of spectra (I often become suspicious if I can reproduce the same goodness of fit results with the same ObsIDs as reported in those publications). The point is that either simple or complex, at the end, if someone has a data file with large number of photons, n is generally larger than observations with sparse photons. This is the reason I happen to see inconceivable d.f.s to a statistician from some papers, like 4754.

First, the chi-square goodness of fit test was designed for agricultural data (or biology considering Pearson’s eugenics) where the sample size is not a scale of scores of thousands. Please, note that bin in astronomy is called cell (class, interval, partition) in statistical papers and books showing applications of chi-square goodness fit tests.

I also like to point out that the chi-square goodness of fit test is different from the chi-square minimization even if they share the same equation. The former is for hypothesis testing and the latter is for optimization (best fit solution). Using the same data for optimization and testing introduces bias. That’s one of the reasons why with large number of data points, cross validation techniques are employed in statistics and machine learning^[1]. Since I consider binning as smoothing, the optimal number of bins and their size depends on data quality and source model property as is done in kernel density estimation or imminently various versions of chi-square tests or distance based nonparametric tests (K-S test, for example).

Although published many decades ago, you might want to check this paper out to get a proper rule of thumb for the number of bins:
“On the choice of the number of class intervals in the application of the chi square test” (JSTOR link) by Mann and Wald in The Annals of Mathematical Statistics, Vol. 13, No. 3 (Sep., 1942), pp. 306-317 where they showed that the number of classes is proportional to N^(2/5) (The underlying idea about the chi-square goodness of fit tests, detailed derivation, and exact equation about the number of classes is given in detail) and this is the reason why I chose a spectrum of 10^5 photons at the beginning. By ignoring other factors in the equation, 10^5 counts roughly yields 100 bins. About 4000 bins implies more than a billion photons, which seems a unthinkable number in X-ray spectral analysis. Furthermore, many reports said Mann and Wald’s criterion results in too many bins and loss of powers. So, n is subject to be smaller than 100 for 10^5 photons.

The other issue with statistical analysis on X-ray spectra is that although photons in each channel/bin can be treated as independent sample but the expected numbers of photons across bins are related via physical source model or so called link function borrowed from generalized linear model. However, well studied link functions in statistics do not match source models in high energy astrophysics. Typically, source models are not analytical. They are non-linear, numerical, tabulated, or black box type that are incompatible with current link functions in generalized linear model that is a well developed, diverse, and robust subject in statistics for inference problems. Therefore, binning data and chi-square minimization seems to be an only strategy for statistical inference about parameters in source models so far (for some “specific” statistical or physical models, this is not true, which is not a topic of this discussion). Mann and Wald’s method for class size assumes equiprobable bins whereas channel or bin probabilities in astronomy would not satisfy the condition. The probability vector of multinomial distribution depends on binning, detector sensitivity, and source model instead of the equiprobable constraint from statistics. Well, it is hard to device an purely statistically optimal binning/grouping method for X-ray spectral analysis.

Instead of individual group/bin dependent smoothing (S/N>3 grouping, for example), I, nevertheless, wish for developing binning/grouping schemes based on total sample size N particularly when N is large. I’m afraid that with the current chi-square test embedded in data analysis packages, the power of a chi-square statistic is so small and one will always have a good reduced chi-square value (astronomers’ simple model assessment tool: the measure of chi-square statistic divided by degrees of freedom and its expected value is one. If the reduced chi-square criterion is close to one, then the chosen source model and solution for parameters is considered to be best fit model and value). The fundamental idea of suitable number of bins is equivalent to optimal bandwidth problems in kernel density estimation, of which objective is accentuating the information via smoothing; therefore, methodology developed in the field of kernel density estimation may suggest how to bin/group the spectrum while preserving the most of information and increasing the efficiency. A modified strategy for binning and applying the chi-square test statistic for assessing model adequacy should be conceived instead of reporting thousands of degrees of freedom.

I think I must quit before getting too bored. Only I’d like to mention quite interesting papers that cited Mann and Wald (1942) and explored the chi square goodness of fit including Johnson’s A Bayesian chi-square test for Goodness-of-Fit (a link is made to the arxiv pdf file) which might provide more charm to astronomers who like to modify their chi-square methods in a Bayesian way. A chapter “On the Use and Misuse of Chi-Square” (link to google book excerpt) by KL Delucchi in A Handbook for Data Analysis in the Behavioral Sciences (1993) reads quite intriguing although the discussion is a reminder for behavior scientists.

Lastly, I’m very sure that astronomers explored properties of the chi-square statistic and chi-square type tests with their data sets. I admit that I didn’t make an expedition for such works since those are few needles in a mound of haystack. I’ll be very delighted to see an astronomers’ version of “use and misuse of chi-square,” a statistical account for whether the chi-square test with huge degrees of freedom is powerful enough, or any advice on that matter will be very much appreciated.

a rough sketch of cross validation: assign data into a training data set and a test set. get the bet fit from the training set and evaluate the goodness-of-fit with that best fit with the test set. alternate training and test sets and repeat. wiki:cross_validationa

Correlation is not causation

vlk — Fri, 06 Mar 2009 13:22:25 +0000

What XKCD says:

The mouseover text on the original says “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’.”

It is a bad habit, hard to break, the temptation is great.

Borel Cantelli Lemma for the Gaussian World

hlee — Wed, 03 Dec 2008 04:31:29 +0000

Almost two year long scrutinizing some publications by astronomers gave me enough impression that astronomers live in the Gaussian world. You are likely to object this statement by saying that astronomers know and use Poisson, binomial, Pareto (power laws), Weibull, exponential, Laplace (Cauchy), Gamma, and some other distributions.^[1] This is true. I witness that these distributions are referred in many publications; however, when it comes to obtaining “BEST FIT estimates for the parameters of interest” and “their ERROR (BARS)”, suddenly everything goes back to the Gaussian world.^[2]

Borel Cantelli Lemma (from Planet Math): because of mathematical symbols, a link was made but any probability books have the lemma with proofs and descriptions.

I believe that I live in the RANDOM world. It is not necessarily always Gaussian but with large probability it looks like Gaussian thanks to Large Sample Theory. Here’s the question; “Do astronomers believe the Borel Cantelli Lemma (BCL) for their Gaussian world? And their bottom line of adopting Gaussian almost all occasions/experiments/data analysis is to prove this lemma for the Gaussian world?” Otherwise, one would like to be more cautious and would reason more before the chi-square goodness of fit methods are adopted. At least, I think that one should not claim that their chi-square methods are statistically rigorous, nor statistically sophisticated — for me, astronomically rigorous and sophisticated seems adequate, but no one say so. Probably, saying “statistically rigorous” is an effort of avoiding self praising and a helpless attribution to statistics. Truly, their data processing strategies are very elaborated and difficult to understand. I don’t see why under the name of statistics, astronomers praise their beautiful and complex data set and its analysis results. Often times, I stop for a breath to find an answer for why a simple chi-square goodness of fit method is claimed to be statistically rigorous while I only see the complexity of data handling given prior to the feed into the chi-square function.

The reason of my request for this one step backward prior to the chi-square method is that astronomer’s Gaussian world is only a part of multi-distributional universes, each of which has non negative probability measure.^[3] Despite the relatively large probability, the Gaussian world is just one realization from the set of distribution families. It is not an almost sure observation. Therefore, there is no need of diving into those chi-square fitting methods intrinsically assuming Gaussian, particularly when one knows exact data distributions like Poisson photon counts.

This ordeal of the chi-square method being called statistically rigorous gives me an impression that astronomers are under a mission of proving the grand challenge by providing as many their fitting results as possible based on the Gaussian assumption. This grand challenge is proving Borel-Cantelli Lemma empirically for the Gaussian world or in extension,

Based on the consensus that astronomical experiments and observations (A_i) occur in the Gaussian world and their frequency increase rapidly (i=1,…,n where n goes to infinity), for every experiment and observation (iid), by showing $$\sum_{i=1}^\infty P(A_i) =\infty,$$ the grand challenge that P(A_n, i.o.)=1 or the Gaussian world is almost always expected from any experiments/observations, can be proven.

Collecting as many results based on the chi-square methods is a sufficient condition for this lemma. I didn’t mean to ridicule but I did a bit of exaggeration by saying “the grand challenge.” By all means, I’m serious and like to know why astronomers are almost obsessed with the chi-square methods and the Gaussian world. I want to think plainly that adopting a chi-square method blindly is just a tradition, not a grand challenge to prove P(Gaussian_n i.o.)=1. Luckily, analyzing data in the Gaussian world hasn’t confronted catastrophic scientific fallacy. “So, why bother to think about a robust method applicable in any type of distributional world?”

Fortunately, I sometimes see astronomers who are not interested in this grand challenge of proving the Borel Cantelli Lemma for the Gaussian world. They provoke the traditional chi-square methods with limited resources – lack of proper examples and supports. Please, don’t get me wrong. Although I praise them, I’m not asking every astronomer to be these outsiders. Statisticians need jobs!!! Nevertheless, a paragraph and a diagnostic plot, i.e. a short justifying discussion for the chi-square is very much appreciated to convey the idea that the Gaussian world is the right choice for your data analysis.

Lastly, I’d like to raise some questions. “How confident are you that residuals between observations and the model are normally distribution only with a dozen of data points and measurement errors?” “Is the least square fitting is only way to find the best fit for your data analysis?” “When you know the data distribution is skewed, are you willing to use Δ χ₂ for estimating σ since it is the only way Numerical Recipe offers to estimate the σ?” I know that people working on their project for many months and years. Making an appointment with folks at the statistical consulting center of your institution and spending an hour or so won’t delay your project. Those consultants may or may not confirm that the strategies of chi-square or least square fitting is the best and convenient way. You may think statistical consulting is wasting time because those consultants do not understand your problems. Yet, your patience will pay off. Either in the Gaussian or non-Gaussian world, you are putting a correct middle stone to build a complete and long lasting tower. You already laid precious corner stones.

It is a bit disappointing fact that not many mention the t distribution, even though less than 30 observations are available.
To stay off this Gaussian world, some astronomers rely on Bayesian statistics and explicitly say that it is the only escape, which is sometimes true and sometimes not – I personally weigh more that Bayesians are not always more robust than frequentist methods as opposed to astronomers’ discussion about robust methods.
This non negativity is an assumption, not philosophically nor mathematically proven. My experience tells me the existence of Poissian world so that P(Poisson world)>0 and therefore, P(Gaussian world)<1 in reality.

after “Thanks to Henrietta Leavitt”

hlee — Fri, 07 Nov 2008 03:22:26 +0000

Personally, it was a highly anticipated symposium at CfA because I was fascinated about the female computers’ (or astronomers’) contributions that occurred here about a century ago even though at that time women were not considered as scientists but mere assistants for tedious jobs.

I learned more history particularly about Ms. Henrietta Leavitt who first speculated the period-luminosity relation from Cepheid stars. Her work is a real painstaking task that cannot be compared to finding a needle in a haystack. It’s like finding some needles from a same manufacturer from countless haystacks, which may or may not have a needle from the specific manufacturer. The worst part is, needles are needles. Not many needles have tags like your clothing for an identification.

However, I was disappointed because of two reasons. First is a minor disappointment but very valuable. The author (George Johnson) of the book, Miss Leavitt’s star – I haven’t read, actually I didn’t know it exists until today – answered my question that he does not think Ms Leavitt’s was exposed to statistical research. Finding a relationship between period and luminosity is closely related a simple regression analysis and I thought she knew about statistics to associate her discovery to now so called, the Leavitt’s law. This disappointment actually lead me to question when the statistical analysis kicked in in astronomy, particularly finding relationships in any studies related to the standard candle, to find out the correct estimate of the Hubble constant.

The second reason of my disappointment is very poorly executed statistics. Obviously, it’s not Ms. Leavitt who imposed such strange trend and statistical malpractices (or carelessness) in regression analysis among astronomers. Whenever speakers bring out scatter plots with regression lines and data points with error bars, I keep murmuring silently, “Oh, my God, how come they blindly did that?” There were statistical issues to be addressed prior to stating that their results support a certain hypothesis instead of putting a straight line and claiming that – “see, how good the slope is” – the Hubble constant is # plus minus $. A high leverage point on the right in addition to less than a dozen points clumped in the left corner, without various diagnostics tools in regression analysis, one does not claim that the straight line is a good fit nor can say that the analysis backs up the hypothesis. Perhaps these statistical diagnoses only advocate their concluding estimates and their uncertainty, and so are omitted. However, my feeling upon looking plots tells me that a simple bootstrap could prove that their estimates are not accurate as they think. Until you try, you don’t know, though. I may email those speakers politely if I can have data points they used for their scatter plots. Unfortunately, I know no one is willing to give me their data points for such unjust cause since even good causes, I had experiences of indifference (I myself might do the same if I were in their positions, no complaints!!!).

Regardless of these disappointments from the statistical instinct, it was a scientifically very interesting symposium and like to thank who made great efforts to put things together. It helped me to resolve some of my crave to know about Ms. Leavitt and to satisfy one of my old wishes that her work to be recognized under her name. If there is one, I wish I could have attended the symposium to commemorate the centennial of student-t, this year. It’s always good to know the history better to move forward.

Asides, during G. Johnson’s talk, he showed pictures of apt. building, which I see everyday, that Ms. Leavitt made her residence until her death and of Mr. Auburn cemetery, very beautiful calming place, where she was buried. I wished she had lived longer to see a glimpse of her great contribution to astronomical science.

[Q] Objectivity and Frequentist Statistics

vlk — Mon, 29 Sep 2008 06:15:14 +0000

Is there an objective method to combine measurements of the same quantity obtained with different instruments?

Suppose you have a set of N₁ measurements obtained with one detector, and another set of N₂ measurements obtained with a second detector. And let’s say you wanted something as simple as an estimate of the mean of the quantity (say the intensity) being measured. Let us further stipulate that the measurement errors of each of the points is similar in magnitude and neither instrument displays any odd behavior. How does one combine the two datasets without appealing to subjective biases about the reliability or otherwise of the two instruments?

We’ve mentioned this problem before, but I don’t think there’s been a satisfactory answer.

The simplest thing to do would be to simply pool all the measurements into one dataset with N=N₁+N₂ measurements and compute the mean that way. But if the number of points in each dataset is very different, the simple combined sample mean is actually a statement of bias in favor of the dataset with more measurements.

In a Bayesian context, there seems to be at least a well-defined prescription: define a model, compute the posterior probability density for the model parameters using dataset 1 using some non-informative prior, use this posterior density as the prior density in the next step, where a new posterior density is computed from dataset 2.

What does one do in the frequentist universe?

[Update 9/30] After considerable discussion, it seems clear that there is no way to do this without making some assumption about the reliability of the detectors. In other words, disinterested objectivity is a mirage.

Classification and Clustering

hlee — Thu, 18 Sep 2008 23:48:43 +0000

Another deduced conclusion from reading preprints listed in arxiv/astro-ph is that astronomers tend to confuse classification and clustering and to mix up methodologies. They tend to think any algorithms from classification or clustering analysis serve their purpose since both analysis algorithms, no matter what, look like a black box. I mean a black box as in neural network, which is one of classification algorithms.

Simply put, classification is regression problem and clustering is mixture problem with unknown components. Defining a classifier, a regression model, is the objective of classification and determining the number of clusters is the objective of clustering. In classification, predefined classes exist such as galaxy types and star types and one wishes to know what prediction variables and their functional allow to separate Quasars from stars without individual spectroscopic observations by only relying on handful variables from photometric data. In clustering analysis, there is no predefined class but some plots visualize multiple populations and one wishes to determine the number of clusters mathematically not to be subjective in concluding remarks saying that the plot shows two clusters after some subjective data cleaning. A good example is that as photons from Gamma ray bursts accumulate, extracting features like F_{90} and F_{50} enables scatter plots of many GRBs, which eventually led people believe there are multiple populations in GRBs. Clustering algorithms back the hypothesis in a more objective manner opposed to the subjective manner of scatter plots with non statistical outlier elimination.

However, there are challenges to make a clear cut between classification and clustering both in statistics and astronomy. In statistics, missing data is the phrase people use to describe this challenge. Fortunately, there is a field called semi-supervised learning to tackle it. (Supervised learning is equivalent to classification and unsupervised learning is to clustering.) Semi-supervised learning algorithms are applicable to data, a portion of which has known class types and the rest are missing — astronomical catalogs with unidentified objects are a good candidate for applying semi-supervised learning algorithms.

From the astronomy side, the fact that classes are not well defined or subjective is the main cause of this confusion in classification and clustering and also the origin of this challenge. For example, will astronomer A and B produce same results in classifying galaxies according to Hubble’s tuning fork?^[1] We are not testing individual cognitive skills. Is there a consensus to make a cut between F9 stars and G0 stars? What make F9.5 star instead of G0? With the presence of error bars, how one is sure that the star is F9, not G0? I don’t see any decision theoretic explanation in survey papers when those stellar spectral classes are presented. Classification is generally for data with categorical responses but astronomer tend to make something used to be categorical to continuous and still remain to apply the same old classification algorithms designed for categorical responses.

From a clustering analysis perspective, this challenge is caused by outliers, or peculiar objects that do not belong to the majority. The size of this peculiar objects can make up a new class that is unprecedented before. Or the number is so small that a strong belief prevails to discard these data points, regarded as observational mistakes. How much we can trim the data with unavoidable and uncontrollable contamination (remember, we cannot control astronomical data as opposed to earthly kinds)? What is the primary cause defining the number of clusters? physics, statistics, astronomers’ experience in processing and cleaning data, …

Once the ambiguity in classification, clustering, and the complexity of data sets is resolved, another challenge is still waiting. Which black box? For the most of classification algorithms, Pattern Recognition and Machine Learning by C. Bishop would offer a broad spectrum of black boxes. Yet, the book does not include various clustering algorithms that statisticians have developed in addition to outlier detection. To become more rigorous in selecting a black box for clustering analysis and outlier detection, one is recommended to check,

Clustering Analysis by Everitt, Landau, and Leese
Data Clustering: Theory, Algorithms, and Applications by Gan, Ma, and Wu
collection of articles and presentation files on nonparametric multivariate analysis by Robert Serfling (Yes, the author of the classical book, Approximation Theorems of Mathematical Statistics), particularly about data depth and outlier detection and
robust statistics by Peter Huber

For me, astronomers tend to be in a haste owing to the pressure of publishing results immediately after data release and to overlook suitable methodologies for their survey data. It seems that there is no time for consulting machine learning specialists to verify the approaches they adopted. My personal prayer is that this haste should not be settled as a trend in astronomical survey and large data analysis.

Check out the project, GALAXY ZOO

A History of Markov Chain Monte Carlo

hlee — Wed, 17 Sep 2008 18:11:01 +0000

I’ve been joking about the astronomers’ fashion in writing Markov chain Monte Carlo (MCMC). Frequently, MCMC was represented by Monte Carlo Markov Chain in astronomical journals. I was curious about the history of this new creation. Overall, I thought it would be worth to learn more about the history of MCMC and this paper was up in arxiv:

[stat.CO:0808.2902] A History of Markov Chain Monte Carlo–Subjective Recollections from Incomplete Data– by C. Robert and G. Casella
Abstract: In this note we attempt to trace the history and development of Markov chain Monte Carlo (MCMC) from its early inception in the late 1940′s through its use today. We see how the earlier stages of the Monte Carlo (MC, not MCMC) research have led to the algorithms currently in use. More importantly, we see how the development of this methodology has not only changed our solutions to problems, but has changed the way we think about problems.

Here is the year list of monumental advances in the MCMC history,

1946: ENIAC
late 1940′s: inception along with Monte Carlo methods.
1953: Metropolis algorithm published in Journal of Chemical Physics (Metropolis et al.)
1970: Hastings algorithms in Biometrika (Hastrings)
1974: Gibbs sampler and Hammersley-Clifford theorem paper by Besag and its discussion by Hammersley in JRSSS B
1977: EM algorithm in JRSSS B (Dempster et al)
1983: Simulated Annealing algorithm (Kirkpatrick et al.)
1984: Gibbs sampling in IEEE Trans. Pattern Anal. Mach. Intell. (Geman and Geman, this paper is responsible for the name)
1987: data augmentation in JASA (Tanner and Wong)
1980s: image analysis and spatial statistics enjoyed MCMC algorithms, not popular with others due to the lack of computing power
1990: seminal paper by Gelfand and Smith in JSAS
1991: BUGS was presented at the Valencia meeting
1992: introductory paper by Casella and Georgy
1994: influential MCMC theory paper by Tierney in Ann. Stat.
1995: reversible jump algorithm in Biometrika (Green)
mid 1990′s: boom of MCMC due to particle filters, reversible jump and perfect sampling (second-generation of MCMC revolution)

and a nice quote from conclusion.

MCMC changed out emphasis from “closed form” solutions to algorithms, expanded our immpact to solving “real” applied problems, expanded our impact to improving numerical algorithms using statistical ideas, and led us into a world where “exact” now means “simulated”!

If you consider applying MCMC methods in your data analysis, references listed in Robert and Casella serve as a good starting point.