Quotes from Common Errors in Statistics

by P.I.Good and J.W.Hardin. Publisher’s website

My astronomer neighbor mentioned this book a while ago and quite later I found intriguing quotes.

GIGO: Garbage in; garbage out. Fancy statistical methods will not rescue garbage data. Course notes of Raymond J. Carroll (2001)

I often see a statement like data were grouped/binned to improve statistics. This seems hardly true unless the astronomer knows the true underlying population distribution from which those realizations (either binned or unbinned) are randomly drawn. Nonetheless, smoothing/binning (modifying sample) can help hypothesis testing to infer the population distribution. This validation step is often ignored, though. For the righteous procedures of statistics application, I hope astronomers adopt the concepts in the design of experiments to collect good quality data without wasting resources. What I mean by wasting resources is that, due to the instrumental and atmospheric limitations, indefinite exposure is not necessary to collect good quality image. Instead of human eye inspection, machine can do the job. I guess that minimax type optimal points exist for operating telescopes, feature extraction/detection, or model/data quality assessment. Clarifying the sources of uncertainty and stratifying them for testing, sampling, and modeling purposes as done in analysis of variance is quite unexplored in astronomy. Instead, more efforts go to salvaging garbage and so far, many gems are harvested by tremendous amount of efforts. But, I’m afraid that it could get as much painful as gold miners’ experience during the mid 19th century gold rush.

Interval Estimates (p.51)
A common error is to specify a confidence interval in the form (estimate – k*standard error, estimate+k*standard error). This form is applicable only when an interval estimate is desired for the mean of a normally distributed random variable. Even then k should be determined for tables of the Student’s t-distribution and not from tables of the normal distribution.

How to get appropriate degrees of freedom seems most relevant to avoid this error when estimates are the coefficients of complex curves or equation/model itself. The t-distribution with a large d.f. (>30) is hardly discernible from the z-distribution.

Desirable estimators are impartial,consistent, efficient, robust, and minimum loss. Interval estimates are to be preferred to point estimates; they are less open to challenge for they convey information about the estimate’s precision.

Every Statistical Procedure Relies on Certain Assumptions for correctness.

What I often fail to find from astronomy literature are these assumptions. Statistics is not elixir to every problem but works only on certain conditions.

Know your objectives in testing. Know your data’s origins. Know the assumptions you feel comfortable with. Never assign probabilities to the true state of nature, but only to the validity of your own predictions. Collecting more and better data may be your best alternative

Unfortunately, the last sentence is not an option for astronomers.

From Guidelines for a Meta-Analysis
Kepler was able to formulate his laws only because (1) Tycho Brahe has made over 30 years of precise (for the time) astronomical observations and (2) Kepler married Brahe’s daughter and thus gained access to his data.

Not exactly same but it reflects some reality of contemporary. Without gaining access to data, there’s not much one can do and collecting data is very painstaking and time consuming.

From Estimating Coefficient
…Finally, if the error terms come from a distribution that is far from Gaussian, a distribution that is truncated, flattened or asymmetric, the p-values and precision estimates produced by the software may be far from correct.

Please, double check numbers from your software.

To quote Green and Silverman (1994, p. 50), “There are two aims in curve estimation, which to some extent conflict with one another, to maximize goodness-of-fit and to minimize roughness.

Statistically significant findings should serve as a motivation for further corroborative and collateral research rather than as a basis for conclusions.

To be avoided are a recent spate of proprietary algorithms available solely in software form that guarantee to find a best-fitting solution. In the worlds of John von Neumann, “With four parameters I can fit an elephant and with five I can make him wiggle his trunk.” Goodness of fit is no guarantee of predictive success, …

If physics implies wiggles, then there’s nothing wrong with an extra parameter. But it is possible that best fit parameters including these wiggles might not be the ultimate answer to astronomers’ exploration. It can be just a bias due to introducing this additional parameter for wiggles in the model. Various statistical tests are available and caution is needed before reporting best fit parameter values (estimates) and their error bars.

Leave a comment