I’ve noticed that there are rapidly growing interests and attentions in data mining and machine learning among astronomers but the level of execution is yet rudimentary or partial because there has been no comprehensive tutorial style literature or book for them. I recently introduced a machine learning book written by an engineer. Although it’s a very good book, it didn’t convey the foundation of machine learning built by statisticians. In the quest of searching another good book so as to satisfy the astronomers’ pursuit of (machine) learning methodology with the proper amount of statistical theories, the first great book came along is **The Elements of Statistical Learning**. It was chosen for this writing not only because of its fame and its famous authors (Hastie, Tibshirani, and Friedman) but because of my personal story. In addition, the 2nd edition, which contains most up-to-date and state-of-the-art information, was released recently.

First, the book website:

The Elements of Statistical Learning byHastie, Tibshirani, and Friedman

You’ll find examples, R codes, relevant publications, and plots used in the text books.

Second, I want to tell how I learned about this book before its first edition was published. Everyone has a small moment of meeting very famous people. Mine is shaking hands with President Clinton, in 2000. I still remember the moment vividly because I really wanted to tell him that ice cream was dripping on his nice suit but the top of the line guards blocked my attempt of speaking/pointing icecream dripping with a finger afterward the hand shaking. No matter what context is, shaking hands with one of the greatest presidents is a memorable thing. Yet it was not my cherishing moment because of icecreaming dripping and scary bodyguards. My most cherishing moment of meeting famous people is the half an hour conversation with late Prof. Leo Breinman (click for my two postings about him), author of probability textbook, creator of CART, and the most forefront pioneer in machine learning.

The conclusion of that conversation was a book soon to be published after explaining him my ideas of applying statistics to astronomical data and his advices to each problems. I was not capable to understand every statistics so that his answer about this new coming book at that time was the most relevant and apt one.

This conversation happened during the 3rd Statistical Challenges in Modern Astronomy (SCMA). Not long passed since I began my graduate study in statistics but had an opportunity to assist the conference organizer, my advisor Dr. Babu and to do some chores during the conference. By accident, I read the book by Murtagh about multivariate data analysis, so I wanted to speak to him. Except that, I have no desire to speak renown speakers and attendees. Frankly, I didn’t have any idea who’s who at the conference and a few years later, I realized that the conference dragged many famous people and the density of such people was higher than any conference I attended. Who would have imagine that I could have a personal conversation with Prof. Breiman, at that time. I have seen enough that many famous professors train people during conferences. Getting a chance for chatting some seconds are really hard and tall/strong people push someone small like me away always.

The story goes like this: a sunny perfect early summer afternoon, he was taking a break for a cigar and I finished my errands for the session. Not much to do until the end of session, I decided to take some fresh air and I spotted him enjoying his cigar. Only the worst was that I didn’t know he was the person of CART and the founder of statistical machine learning. Only from his talk from the previous session, I learned he was a statistician, who did data mining on galaxies. So, I asked him if I can join him and ask some questions related to some ideas that I have. One topic I wanted to talk about classification of SN light curves, by that time from astronomical text books, there are Type I & II, and Type I has subcategories, Ia, Ib, and Ic. Later, I heard that there is Type III. But the challenge is observations didn’t happen with equal intervals. There were more data mining topics and the conversation went a while. In the end, he recommended me a book which will be published soon.

Having such a story, a privilege of talking to late Prof. Breiman through an very unique meeting, SCMA, before knowing the fame of the book, this book became one of my favorites. The book, indeed, become popular, around that time, almost only book discussing statistical learning; therefore, it was an excellent textbook for introducing statistics to engineerers and machine learning to statisticians. In the mean time, statistical learning enjoyed popularity in many disciplines that have data sets and urging for learning with the aid of machine. Now books and journals on machine learning, data mining, and knowledge discovery (KDD) became prosperous. I was so delighted to see the 2nd edition in the market to bridge the gap over the years.

I thank him for sharing his cigar time, probably his short free but precious time for contemplation, with me. I thank his patience of spending time with such an ignorant girl with a foreign english accent. And I thank him for introducing a book which will became a bible in the statistical learning community within a couple of years (I felt proud of myself that I access the book before people know about it). Perhaps, astronomers cannot have many joys from this book that I experienced from how I encounter the book, who introduced the book, whether the book was used in a course, how often book is referred, etc. But I assure that it’ll narrow the gap in the notions how astronomers think about data mining (preprocessing, pipelining, and bulding catalogs) and how statisticians treat data mining. The newly released 2nd edition would help narrowing the gap further and assist astronomers to coin brilliant learning algorithms specific for astronomical data. [The END]

—————————– Here, I patch my scribbles about the book.

What distinguish this book from other machine learning books is that not only authors are big figures in statistics but also fundamentals of statistics and probability are discussed in all chapters. Most of machine learning books only introduce elementary statistics and probability in chapter 2, and no basics in statistics is discussed in later chapters. Generally, empirical procedures, computer algorithms, and their results without presenting basic theories in statistics are presented.

You might want to check the book’s website for data sets if you want to try some ideas described there

The Elements of Statistical Learning

In addition to its historical footprint in the field of statistical learning, I’m sure that some astronomers want to check out topics in the book. It’ll help to replace some data analysis methods in astronomy celebrating their centennials sooner or later with state of the art methods to cope with modern data.

This new edition reflects some evolutions in statistical learning whereas the first edition has been an excellent harbinger of the field. Pages quoted from the 2nd edition.

[p.28] Suppose in fact that our data arose from a statistical model $Y=f(X)+e$ where the random error e has E(e)=0 and is independent of X. Note that for this model, f(x)=E(Y|X=x) and in fact the conditional distribution Pr(Y|X) depends on X

onlythrough the conditional mean f(x).

The additive error model is a useful approximation to the truth. For most systems the input-output pairs (X,Y) will not have deterministic relationship Y=f(X). Generally there will be other unmeasured variables that also contribute to Y, including measurement error. The additive model assumes that we can capture all these departures from a deterministic relationship via the error e.

How statisticians envision “model” and “measurement errors” quite different from astronomers’ “model” and “measurement errors” although in terms of “additive error model” they are matching due to the properties of Gaussian/normal distribution. Still, the dilemma of hen or eggs exists prior to any statistical analysis.

[p.30] Although somewhat less glamorous than the learning paradigm, treating supervised learning as a problem in function approximation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probabilistic inference to be applied to the problem. This is the approach taken in this book.

Strongly recommend to read chapter 3, Linear Methods for Regression: In astronomy, there are so many important coefficients from regression models, from Hubble constant to absorption correction (temperature and magnitude conversion is another example. It seems that these relations can be only explained via OLS (ordinary least square) with the homogeneous error assumption. Yet, books on regressions and linear models are not generally thin. As much diversity exists in datasets, more amount of methodology, theory and assumption exists in order to reflect that diversity. One might like to study the statistical properties of these indicators based on mixture and hierarchical modeling. Some inference, say population proportion can be drawn to verify some hypotheses in cosmology in an indirect way. Understanding what regression analysis and assumptions and how statistician efforts made these methods more robust and interpretable, and reflecting reality would change forcing E(Y|X)=aX+b models onto data showing correlations (not causality).

]]>A student had to figure out the name of a stellar object as part of an assignment. He was given the following information about it:

- apparent [V] magnitude = 5.76
- B-V = 0.02
- E(B-V) = 0.00
- parallax = 0.0478 arcsec
- radial velocity = -18 km/s
- redshift = 0 km/s

He looked in all the stellar databases but was unable to locate it, so he asked the CfA for help.

Just to help you out, here are a couple of places where you can find comprehensive online catalogs:

See if you can find it!

Answer next ~~week~~ month.

**Update (2010-aug-02):**

The short answer is, I could find no such star in any commonly available catalog. But that is not the end of the story. There does exist a star in the Hipparcos catalog, HIP 103389, that has approximately the right distance (21 pc), radial velocity (-16.1 km/s), and *V* magnitude (5.70). It doesn’t match exactly, and the *B-V* is completely off, but that is the moral of the story.

The thing is, catalogs are not perfect. The same objects often have very different numerical entries in different catalogs. This could be due to a variety of reasons, such as different calibrations, different analysers, or even intrinsic variations in the source. And you can bet your bottom dollar that the quoted statistical uncertainties in the quantities do not account for the observed variance. Take the *B-V* value, for instance. It is 0.5 for HIP 103389, but the initial problem stated that it was 0.02, which makes it an A type star. But if it were an A type star at 21 pc, it should have had a magnitude of *V*~1.5, much brighter than the required 5.76!

I think this illustrates one of the fundamental tenets of science as it is practiced, versus how it is taught. The first thing that a practicing scientist does (especially one not of the theoretical persuasion) is to try and see where the data might be wrong or misleading. It should only be included in analysis after it passes various consistency checks and is deemed valid. The moral of the story is, don’t trust data blindly just because it is a “number”.

]]>When I was teaching statistics, despite undergraduate courses, there were both undergraduate and graduate students of various fields except astrophysics majors. I wondered why they were not encouraged to take some basic statistics whereas they were encouraged to take some computer science courses. As there are many astronomers good at programming and designing tools, I’m sure that recommending students to take statistics courses will renovate astronomical data analysis procedures (beyond Bevington’s book) and hind theories (statistics and mathematics per se, not physics laws).

Here’s an interesting lecture for developing a curriculum for the new era in computer science and why the basic probability theory and statistics is important to raise versatile computer scientists. It could be a bit out dated now because I saw it several months ago.

About a little more than the half way through the lecture, he emphasizes that probability course partaking the computer science curriculum. I wonder any astronomy professor has similar arguments and stresses for any needs of basic probability theories to be learned among young future astrophysicists in order to prevent many statistics misuses appearing in astronomical literature. Particularly confusions between fitting (estimating) and inference (both model assessment and uncertainty quantification) are frequently observed in literature where authors claim their superior statistics and statistical data analysis. I personally sometimes attribute this confusion to the lack of distinction between what is random and what is deterministic, or strong believe in their observed and processed data absent from errors and probabilistic nature.

Many introductory books introduce very interesting problems many of which have some historical origins to introduce probability theories (many anecdotes). One can check out the very basics, **probability axioms**, ** and ****measurable function** from wikipedia. With examples, probability is high school or lower level math that you already know but with jargon you’ll like to recite lexicons many times so that you are get used to foundations, basics, and their theories.

We often mention **measurable** to discuss random variables, uncertainties, and distributions without verbosity. “Assume **measurable space** … ” saves multiple paragraphs in an article and changes the structure of writing. This short adjective implies so many assumptions depending on statistical models and equations that you are using for best fits and error bars.

Consider a LF, that is truncated due to observational limits. The common practice I saw is drawing a histogram in a way that the adaptive binning makes the overall shape reflecting a partial bell shape curve. Thanks to its smoothed look, scientists impose a gaussian curve to partially observed data and find parameter estimates that determine the shape of this gaussian curve. There is no imputation step to fake unobserved points to comprise the full probability space. The parameter space of gaussian curves frequently does not coincide with the physically feasible space; however, such discrepancy is rarely discussed in astronomical literature and subsequent biased results look like a taboo.

Although astronomers emphasize the importance of uncertainties, factorization nor stratification of uncertainties has never been clear (model uncertainty, systematic uncertainty or bias, statistical uncertainties or variance). Hierarchical relationships or correlations among these different uncertainties are never addressed in a full measure. Basics of probability theory and the understanding of random variables would help to characterize uncertainties both in mathematical sense and astrophysical sense. This knowledge also assist appropriate quantification of these characterized uncertainties.

Statistical models are rather simple compared to models of astrophysics. However, statistics is the science of understanding uncertainties and randomness and therefore, some strategies of transcribing from complicated astrophysical models into statistical models, in order to reflect the probabilistic nature of observed (or parameters, for Bayesian modeling), are necessary. Both raw or processed data manifest the behavior of random variables. Their underlying processes determine not only physics models but also statistical models written in terms of random variables and the link functions connecting physics and uncertainties. To my best understanding, bridging and inventing statistical models for astrophysics researches seem tough due to the lack of awareness of basics of probability theory.

Once I had a chance to observe a Decadal survey meeting, which covered so diverse areas in astronomy. They discussed new projects, advancing current projects, career developments, and a little bit about educating professional astronomers apart from public reach (which often receives more importance than university curriculum. I also believe that wide spread public awareness of astronomy is very important). What I missed while I observing the meeting is that interdisciplinary knowledge transferring efforts to broaden the field of astronomy and astrophysics nor curriculum design ideas. Because of its long history, I thought astronomy is a science of everything. Marching a path for a long time made astronomy more or less the most isolated and exclusive science.

Perhaps asking astronomy majors taking multiple statistics courses is too burdensome; therefore being taught by faculty who are specialized in (statistical) data analysis organizes a data analysis course and incorporates several hours of basic probability is more realistic and what I anticipate. With a few hours of bringing fundamental notions in random variables and probability, the claims of “statistical rigorous methods and powerful results” will become more appropriate. Currently, statistics is science but in astronomy literature, it looks more or less like an adjective that modify methods and results like “powerful”, “superior”, “excellent”, “better”, “useful,” and so on. Basics of probability is easily incorporated into introduction of algorithms in designing experiments and optimization methods, which are currently used in a brute force fashion^{[1]}.

Occasionally, I see gems from arxiv written by astronomers. Their expertise in astronomy and their interest in statistics has produced intriguing accounts for statistically rigorous data analysis and inference procedures. Their papers includes explanation of fundamentals of statistics and probability more appropriate to astronomers than statistics textbooks for scientists and engineers of different fields. I wish more astronomers join this venture knowing basics and diversities of statistics to rectify many unconscious misuses of statistics while they argue that their choice of statistics is the most powerful one thanks to plausible results.

- What I mean by a brute force fashion is that trying all methods listed in the software manual, and then later, stating that the method A gave most plausible values that matches with data in a scatter plot

I teach that statistics (done the quantile way) can be simultaneously frequentist and Bayesian, confidence intervals and credible intervals, parametric and nonparametric, continuous and discrete data. My first step in data modeling is identification of parametric models; if they do not fit, we provide nonparametric models for fitting and simulating the data. The practice of statistics, and the modeling (mining) of data, can be elegant and provide intellectual and sensual pleasure. Fitting distributions to data is an important industry in which statisticians are not yet vendors. We believe that unifications of statistical methods can enable us to advertise, “What is your question? Statisticians have answers!”

I couldn’t help liking this paragraph because of its bitter-sweetness. I hope you appreciate it as much as I did.

]]>First of all, I’d like to ask how you would like to estimate the chance of having nukes in a country? What this 80% implies here? But, before getting to the question, I’d like to discuss computing the chance of e coli infection, first.

From the frequentists perspective, computing the chance of e coli infection is investigating sample of lettuce and counts species that are infected: n is the number of infected species and N is the total sample size. 1% means one among 100. Such percentage reports and their uncertainties are very familiar scene during any election periods for everyone. From Bayesian perspective, Pr(p|D) ~ L(D|p) pi(p), properly choosing likelihoods and priors, one can estimate the chance of e coli infection and uncertainty. Understanding of sample species and a prior knowledge helps to determine likelihoods and priors.

How about the chance that country A has nukes? Do we have replicates of country A so that a committee investigate each country and count ones with nukes to compute the chance? We cannot do that. Traditional frequentist approach, based on counting, does not work here to compute the chance. Either using **fiducial likelihood** approach or **Bayesian ** approach, i.e. carefully choosing a likelihood function adequately (priors are only for Bayesian) allows one to compuate such probability of interest. In other words, those computed chances highly depend on the choice of model and are very subjective.

So, here’s my concern. It seems like that astronomers want to know the chance of their spectral data being described by a model (A*B+C)*D (each letter stands for one of models such as listed in Sherpa Models). This is more like computing the chance of having nukes in country A, not counting frequencies of the event occurrence. On the other hand, p-value from goodness of fit tests, LRTs, or F-tests is a number from the traditional frequentists’ counting approach. In other words, p-value accounts for, under the null hypothesis (the (A*B+C)*D model is the right choice so that residuals are Gaussian), how many times one will observe the event (say, reduced chi^2 >1.2) if the experiments are done N times. The problem is that we only have one time experiment and that one spectrum to verify the (A*B+C)*D is true. Goodness of fit or LRT only tells the goodness or the badness of the model, not the statistically and objectively quantified chance.

In order to know the chance of the model (A*B+C)*D, like A has nuke with p%, one should not rely on p-values. If you have multiple models, one could compute pairwise relative chances i.e. odds ratios, or Bayes factors. However this does not provide the uncertainty of the chance (astronomers have the tendency of reporting uncertainties of any point estimates even if the procedure is statistically meaningless and that quantified uncertainty is not statistical uncertainty, as in using delta chi^2=1 to report 68% confidence intervals). There are various model selection criteria that cater various conditions embedded in data to make a right model choice among other candidate models. In addition, post-inference for astronomical models is yet a very difficult problem.

In order to report the righteous chance of (A*B+C)*D requires more elaborated statistical modeling, always brings some fierce discussions between frequentists and Bayesian because of priors and likelihoods. Although it can be very boring process, I want astronomers to leave the problem to statisticians instead of using inappropriate test statistics and making creative interpretation of statistics.

Please, keep this question in your mind when you report probability: what kind of chance are you computing? The chance of e coli infection? Or the chance that A has nukes? Make sure to understand that p-values from data analysis packages does not tell you that the chance the model (A*B+C)*D is (one minus p-value)%. You don’t want to report one minus p-value from a chi-square test statistic as the chance that A has nukes.

]]>When having many pairs of variables that demands numerous scatter plots, one possibility is using parallel coordinates and a matrix of correlation coefficients. If Gaussian distribution is assumed, which seems to be almost all cases, particularly when parametrizing measurement errors or fitting models of physics, then error bars of these coefficients also can be reported in a matrix form. If one considers more complex relationships with multiple tiers of data sets, then one might want to check ANCOVA (ANalysis of COVAriance) to find out how statisticians structure observations and their uncertainties into a model to extract useful information.

I’m not saying those simple examples from wikipedia, wikiversity, or publicly available tutorials on ANCOVA are directly applicable to statistical modeling for astronomical data. Most likely not. Astrophysics generally handles complicated nonlinear models of physics. However, identifying dependent variables, independent variables, latent variables, covariates, response variables, predictors, to name some jargon in statistical model, and defining their relationships in a rather comprehensive way as used in ANCOVA, instead of pairing variables for scatter plots, would help to quantify relationships appropriately and to remove artificial correlations. Those spurious correlations appear frequently because of data projection. For example, datum points on a circle on the XY plane of the 3D space centered at zero, when seen horizontally, look like that they form a bar, not a circle, producing a perfect correlation.

As a matter of fact, astronomers are aware of removing these unnecessary correlations via some corrections. For example, fitting a straight line or a 2nd order polynomial for extinction correction. However, I rarely satisfy with such linear shifts of data with uncertainty because of changes in the uncertainty structure. Consider what happens when subtracting background leading negative values, a unrealistic consequence. Unless probabilistically guaranteed, linear operation requires lots of care. We do not know whether residuals y-E(Y|X=x) are perfectly normal only if μ and σs in the gaussian density function can be operated linearly (about Gaussian distribution, please see the post why Gaussianity? and the reference therein). An alternative to the subtraction is linear approximation or nonparametric model fitting as we saw through applications of principle component analysis (PCA). PCA is used for whitening and approximating nonlinear functional data (curves and images). Taking the sources of uncertainty and their hierarchical structure properly is not an easy problem both astronomically and statistically. Nevertheless, identifying properties of the observed both from physics and statistics and putting into a comprehensive and structured model could help to find out the causality^{[2]} and the significance of correlation, better than throwing numerous scatter plots with lines from simple regression analysis.

In order to understand why statisticians studied ANCOVA or, in general, **ANOVA (ANalysis Of VAriance)** in addition to the material in wiki:ANCOVA, you might want to check this page^{[3]} and to utilize your search engine with keywords of interest on top of ANCOVA to narrow down results.

From the linear model perspective, if a response is considered to be a function of redshift (z), then z becomes a covariate. The significance of this covariate in addition to other factors in the model, can be tested later when one fully fit/analyze the statistical model. If one wants to design a model, say rotation speed (indicator of dark matter occupation) as a function of redshift, the degrees of spirality, and the number of companions – this is a very hypothetical proposal due to my lack of knowledge in observational cosmology. I only want to point that the model fitting problem can be seen from statistical modeling like ANCOVA by identifying covariates and relationships – because the covariate z is continuous, and the degrees are fixed effect (0 to 7, 8 tuples), and the number of companions are random effect (cluster size is random), the comprehensive model could be described by ANCOVA. To my knowledge, scatter plots and simple linear regression are marginalizing all additional contributing factors and information which can be the main contributors of correlations, although it may seem Y and X are highly correlated in the scatter plot. At some points, we must marginalize over unknowns. Nonetheless, we still have some nuisance parameters and latent variables that can be factored into the model, different from ignoring them, to obtain advanced insights and knowledge from observations in many measures/dimensions.

Something, I think, can be done with a small/ergonomic chart/table via hypothesis testing, multivariate regression, model selection, variable selection, dimension reduction, projection pursuit, or names of some state of the art statistical methods, is done in astronomy with numerous scatter plots, with colors, symbols, and lines to account all possible relationships within pairs whose correlation can be artificial. I also feel that trees, electricity, or efforts can be saved from producing nice looking scatter plots. Fitting/Analyzing more comprehensive models put into a statistical fashion helps to identify independent variables or covariates causing strong correlation, to find collinear variables, and to drop redundant or uncorrelated predictors. Bayes factors or p-values can be assessed for comparing models, testing significance their variables, and computing error bars appropriately, not the way that the null hypothesis probability is interpreted.

Lastly, ANCOVA is a complete **[MADS]**.

- This is not an assuring absolute statement but a personal impression after reading articles of various fields in addition to astronomy. My readings of other fields tell that many rely on correlation statistics but less scatter plots by adding straight lines going through data sets for the purpose of imposing relationships within variable pairs
- the way that chi-square fitting is done and the goodness-of-fit test is carried out is understood by the notion that X causes Y and by the practice that the objective function, the sum of (Y-E[Y|X])^2/σ^2 is minimized
- It’s a website of Vassar college, that had a first female faculty in astronomy, Maria Mitchell. It is said that the first building constructed is the Vassar College Observatory, which is now a national historic landmark. This historical factor is the only reason of pointing this website to drag some astronomers attentions beyond statistics.

Heteroscedesticity in regression problems in astronomy has been discussed with various kind of observations since astronomers are interested in correlations and causality between variables in the data set. Perhaps, the heteroscedasticity in volatilty of finance does not have any commonality with astronomical times series data, which could be the primary reason this celebrated model does not appear in ADS. However, there are various times series models branched from ARCH that consider heteroscedasticity in errors and I hope some can be useful to astronomers for analyzing times series data with inhomogeneous errors.

[Note] All links lead to wikipedia.org

]]>

If you do not mind the **time** predictor, it is hard to believe that this is a light curve, time dependent data. At a glance, this data set represents a simple block design for the **one-way ANOVA**. ANOVA stands for **Analysis of Variance,** which is not a familiar nomenclature for astronomers.

Consider a case that you have a very long strip of land that experienced FIVE different geological phenomena. What you want to prove is that crop productivity of each piece of land is different. So, you make FIVE beds and plant same kind seeds. You measure the location of each seed from the origin. Each bed has some dozens of seeds, which are very close to each other but their distances are different. On the other hand, the distance between planting beds are quite far unable to say that plants in the test bed A affects plants in B. In other words, A and B are independent suiting for my statistical inference procedure by the F-test. All you need is after a few months, measuring the total weight of crop yield from each plant (with measurement errors).

Now, let’s look at the plot above. If you replace distance to time and weight to flux, the pattern in data collection and its statistical inference procedure matches with the one-way ANOVA. It’s hard to say this data set is designed for time series analysis apart from the complication in statistical inference due to measurement errors. How to design the statistical study with measurement errors, huge gaps in time, and unequal time intervals is complex and unexplored. It depends highly on the choice of inference methods, assumptions on error i.e. likelihood function construction, prior selection, and distribution family properties.

Speaking of ANOVA, using the F-test means that we assume residuals are Gaussian from which one can comfortably modify the model with additive measurement errors. Here I assume there’s no correlation in measurement errors and plant beds. How to parameterize the measurement errors into model depends on such assumptions as well as how to assess sampling distribution and test statistics.

Although I know this Crab nebula data set is not for the one-way ANOVA, the pattern in the scatter plot drove me to test the data set. The output said to reject the null hypothesis of statistically equivalent flux in FIVE time blocks. The following is R output without measurement errors.

Df Sum Sq Mean Sq F value Pr(>F)

factor 4 4041.8 1010.4 143.53 < 2.2e-16 ***

Residuals 329 2316.2 7.0

If the gaps are minor, I would consider time series with missing data next. However, the missing pattern does not agree with my knowledge in missing data analysis. I wonder how astronomers handle such big gaps in time series data, what assumptions they would take to get a best fit and its error bar, how the measurement errors are incorporated into statistical model, what is the objective of statistical inference, how to relate physical meanings to statistical significant parameter estimates, how to assess the model choice is proper, and more questions. When the contest is over, if available, I’d like to check out any statistical treatments to answer these questions. I hope there are scientists who consider similar statistical issues in these data sets by the INTEGRAL team.

]]>Whenever LHC (Large Hardron Collider, several posts from the slog) is publicly advertised, the grand scale of accelerator (26km) is the center of attention for these unprecedented controlled experiments for particle physics researches. Controlled in conjunction of factorization in statistical experiment designs to eliminate unknowns and to factor in external components (covariates, for example). By the same token, not the grand scale of the accelerator, but the detector and controlled/isolated system, and its designs for collecting data seem most important to me. Without searching for reports, I want to believe that many countless efforts have been put into detectors and data processors, which seem to be overshadowed because of the grand scale of the accelerating tube.

For fun and honoring the speaker’s showing it to the public, you might like to see this youtube rap again.

As a statistician, curious about the detector and the physics leading the designs of such expensive and extreme studies, I was more interested in knowing further on

- how data are collected and
- how study was designed or what are the hypotheses

not the scale of the accelerator nor the feeling inside the 2 degree vacuum tube. There was no clue to find out partial answers to these questions through the public lecture. So, I hope some slog readers could help me understand better the following issues spawn from this public lecture. Let me talk my questions statistically and try to associate them with the image above.

The uncertainty principle by physicists is written roughly as follows:

Δ E Δ t > h

where h is Plank’s constant. Instead of energy and time, Δ x Δ p > h, location and momentum is used as well. This principle is more or less related to **precision** or **bias.** One cannot measure things with 100% precision. In other word, in measuring quantities from physics, there is no exact unbiased estimator (asymptotically unbiased is a different context). In order to observe subparticle in a short time scale, the energy must be high. Yet, unless the energy is extremely high, the uncertainty of when the event happen is huge so that no one can assign exact numerals when the eveny happen. This uncertainty principle is the primary reason for such large accelerator so that particle can gain tremendous energy and therefore, an observer can determine the location and the time of the event (collision, subsequent annihilation, and scatters of subparticles) with uncertainty from the principle of physics.

I’ve always had a conflicting notion about uncertainty in statistics and astronomy. The uncertainty from the Heigenberg’s uncertainty principle and the uncertainty from measure theory and the stochastic nature of data. Although the word is same but the implications are different. The former describes precision as discussed above and the later accuracy (Bevington’s book describes the difference between precision and accuracy, if I recall correctly). When an astronomer has data and computes a best fit and one σ from the chi-square, that σ is quantifying the uncertainty/scale of the Gaussian distribution, a model for residuals that the astronomer has chosen for fitting the data with the model of physics.

When it comes to **measurement errors** it’s more like discussing precision, not accuracy or the scale parameters of distribution functions (family of distributions). Either measurement errors, or computing uncertainty via chi-square minimization or Bayesian posterior distribution estimation, most of procedures to understand uncertainties in astronomical literature is based on parametrizating uncertainties. Luckily we know that Gaussian and Poisson distribution for parametrization works almost all cases in astronomy. Yet, my understanding is that there’s not much distinction between precision and accuracy in astronomical data analysis, not much awareness about the difference between the uncertainty principle from physics and the uncertainty by the stochastic nature of data. This seems causing biased or underestimated results. With jargon of statistics, instead of overlooking, the issues of model mis-identification and model uncertainty^{[1]} of other disciplines are worth to be looked into to narrow the gap.

As a statistician, I approach the problem of uncertainty hierarchically. Start from the simplest that sigma is known and used the given sigma as the ground truth. If statistics does not advocate such condition, then move to a direction of estimating it, and testing whether it is homogenous or heterogeneous error, etc to understand the sampling distribution better and device statistics accordingly. During the procedure, I’ll add a model for measurement errors. If Gaussian, adding statistical uncertainty terms and measurement error terms works well, an easy convolution of Gaussian distributions (see my why gaussianity?). I might have to ignore some factors in my hierarchical modeling procedure if their contribution is almost none but the hierarchical model becomes too complicated for such mediocre gain. Instead, it would be easy to follow the rule of thumb strategy developed by astronomers with great knowledge and experience. Anyway, if parametric strategy does not work, I’ll employ nonparametric approaches. Focusing on Bayesian methodology, it’s like modeling hierarchically from parametric likelihoods and informative, subjective priors to nonparametric likelihoods, objective, noninformative priors. Overall, these are efforts of modeling both physics and errors assuming that measurements are taken accurately; multiple measurements and collecting many photons quantifies how accurately the best fit is obtained. On the other hand, under the uncertainty principle, intrinsic measurement bias (unknown but bounded) is inevitable. Not statistics but physics could tell how precise measurements can be taken. Still it’s uncertainty but different kind. I sometimes confront astronomers mixing strategies of calibrating the uncertainties of different grounds and also I got confused and lost.

I’d like to say that multiple observations (the amount of degrees of freedom in chi-square minimizations, and bins in histograms) are realizations of coupling of bias and variance (precision and accuracy; measurement errors and statistical uncertainty in sigma/error bar) from which the importance of proper parametrization and regularized optimization is never enough to be emphasized to get that right 68% coverage of the uncertainty in a best fit, instead of simple least square or chi-square. Statisticians often discuss the mean square error (see my post [MADS] Law of Total Variance) than the error bar to account for the overall uncertainty in a best fit.

I’m afraid that my words sound gibberish – I hope that statisticians with good commands of literal and scientific languages discuss the uncertainty of physics and of statistics and how it affects choosing statistical methods and drawing statistical inference from (astro)physical data. I’m also afraid that people continue going for one sigma by feeding the data into the chi-square function and adding speculated systematic errors (say 15% of the computed sigma from the chi-square minimization) without second thoughts on the implications of uncertainty and on assumptions for its quantification methods.

I wonder how the shot of above image is taken when protons are colliding. There should be a tremendous number of subparticles generated from the collisions of many protons. Unless there is a single photo frame that takes traces of all those particles (collision happens in 3D camera chamber? Perhaps, they use medical imaging, tomography techniques but processing time wise I doubt its feasibility), I think those traces are the reconstruction of multiple cross sectional shots. My biggest concern was how each line and dot you see from the picture can be associated to a certain particle. Physics and standard model can tell that their trajectories are distinguishable, depending on their charges, types, and mass but there are, say, millions of events happening in the matter of extremely short time scale! How certain one can say this is the trace of a certain particle.

The speaker discussed massive data and uncertainty as another challenge. So many procedures in terms of (statistical) data analysis seem not explored yet although theory of physics is very sophisticated and complicated. If physics is an deductive/deterministic science, then statistics is inductive/stochastic. I personally believe that theories are able to conclude the same from both physical and statistical experiments. I guess now it’s time to prove such thesis with data and statistics and it starts with identifying particles’ traces and their meta-data.

To create an image of many particles as above when we have the identifiability issue and the uncertainty in time and space, I wonder how pictures are constructed from each collision. The lecturer used an analogy of a dodecaheron calendar with missing months to deliver the feeling of image reconstruction in particle physics. Whenever I see such images of many ray traces and hear promises that LHC will deliver, I’ve been wondering how they reconstruct those traces after the particle collisions and measuring times of events. Thanks to the uncertainty principle and its mediocre scale, there must be some tremendous constraints and missings. How much information is contained in that reconstructed image? How much information loss is inevitable due to those constraints. It would be very interesting to know each step from detectors to images and find statistical and information theoretical challenges.

Colliding one proton to the other seems ideal to discover the unification theory advocating the standard model by tracing individual relatively small number of particles. If so, the picture above could have been simpler than what it looks. Unfortunately, it’s not the case and huge number of protons are sent for collision. I’ve kept heard the gigantic size of data that particle physics experiments create. I wondered how such massive data are processed while the speaker showed the picture of one of world best computing facilities at CERN. Not just for automated pipelining but for processing, cleaning, summarizing, and evaluating from statistical aspects would require clever algorithms to make most of those multiple processors.

I still think that quests for searching particles via LHC are classical decision theoretic hypothesis testing problem: the null hypothesis is no new(unobserved particle) vs. the alternative hypothesis contains the model/information of new particle by the theory (SUSY, antimatter etc). Statistically speaking, in order to observe such matter or to reject null hypothesis comfortable, we need statistically powerful tests, where Neyman-Pearson test/construction is often mentioned. One needs to design an experiment that is powerful enough (power here has two connotations: one is physically powerful enough to make proton have high energy so that one can observe particles in the brief time and space frame, and the other is statistically powerful such that if such new particles exist, the test is powerful enough to reject the null hypothesis with decent power and false discovery rate). How to transcribe data and models into a powerful test seems still an open question to physicists. You can check discussions from the links in the PHYSTAT LHC 2007 post.

In the similar context of source detection in astronomy, how do physicists define and segregate source (particle of interest, higgs, for example) from background? It’s also related to identifiability of particles shown in the picture. How can a physicist see an rare event among tons of background events which form a wide sampling distribution or in other words, that have a huge uncertainty as an ensemble. Also the source event has its own uncertainty because of the uncertainty principle. How to form robust thresholding methods? How to develop Bayesian learning strategies for better detection? Perhaps the underlying (statistical) models are different for particle physics and for astronomy, but the basic idea of how to apply statistical inference seem not much different from the fact that 1. background can be more dominant, 2. background is used for the null hypothesis, and 3. the source distribution comprise the alternative distribution. It’ll be very interesting to collect statistics for source detection and formalize those methods so that consistent source detection results can be achieved by devising statistics suits the data types.

I’d like to quote two phrases from the public lecture.

finding an

atomin a haystack

A cliche for all groups of scientists. I’ve heard “finding a needle in a haystack” so many times because of the new challenge that we confronted from the information era. On the other hand, replacing “needle” to “atom” was new to me. Unfortunately, my impression is that physicists are not equipped with tools to do such data mining either a needle or an atom. I wonder what computer scientists can offer them for this more challenging quest to answer the fundamental question about the universe.

it’s an exciting time to be physicists

The speaker used physicists but I’ve heard the same sentence from astronomers and statisticians with their professions replaced with physicists. After hearing it too often from various people, I became doubtful since I cannot feel such excitements imminently. It feels like after hearing **change** too often before it happens, one cannot feel the real progress of changes. Words always travel faster than actions. Sometimes words can be just empty promise. That’s why I thought it’s a cliche and irony. Perhaps it’s due to that fact I’m at the intersection of the combination of these scientist sets, not at the center of any set. Ironically, defining boundaries is also fuzzy nowadays. Perhaps, I’m already excited and afraid of transiting down to a lower energy level. Anyway, being enthusiastic and living in an exciting time seems different matter.

I haven’t heard the news about Phystat 2009, whose previous meetings occurred every odd year in the 21st century. Personally, their meeting agenda and subsequent proceedings were very informative and offered clues to my questions. I hope the next meeting soon to be held.

- the notions of model uncertainty among astronomers and statisticians are different. Hopefully, I have time to talk about it

Full text search in ADS with “**null hypothesis probability**” yield 77 related articles (link removed. Search results are floating urls, probably?). Many of them contained the phrase “**null hypothesis probability**” as it is. The rest were in the context of “given the **null hypothesis, the probability** of …” I’m not sure this ADS search result includes **null hypothesis probability** written in tables and captions. It’s possible more than 77 could exist. The majority of articles with the “null hypothesis probability” are just reporting numbers from screen outputs from the chosen data analysis system. Discussions and interpretations of these numbers are more focused toward reduced χ^{2} close to ONE, astronomers’ most favored model selection criterion. Sometimes, I got confused with the goal of their fitting analysis because the driven force is that “*make the reduced chi-square closed to one and make residuals look good*“. Instead of being used for statistical inferences and measures, a statistic works as an objective function. Numerically (chi-square) or pictorially (residuals) is overshadowed the fundamentals that you observed relatively low number of photons under Poisson distribution and those photons are convolved with complicated instruments. It is possible to underestimated statistically, the reduced chi-sq is off from the unity but based on robust statistics, one still can say the model is a good fit.

Instead of talking about the business of the chi-square method, one thing I wanted to point out from this “null hypothesis probability” investigation is that there was a big presenting style and field distinction between papers of **the null hypothesis probability** (spectral model fitting) and of **given the null hypothesis, the probability of** (cosmology). Beyond this casual and personal finding about the style difference, the following quotes despaired me because I couldn’t find answers from statistics.

- MNRAS, v.340, pp.1261-1268 (2003):
*The temperature and distribution of gas in CL 0016+16 measured with XMM-Newton*(Worrall and Birkinshaw)

With reduced chi square of 1.09 (chi-sq=859 for 786 d.o.f) the null hypothesis probability is 4 percent but this is likely to result from the high statistical precision of the data coupled with small remaining systematic calibration uncertainties

I couldn’t understand why p-value=0.04 is associated with high statistical precision of the data coupled with small remaining systematic calibration uncertainties. Is it a polite way to say the chi-square method is wrong due to systematic uncertainty? Or does this mean the stat uncertainty is underestimated due the the correlation with sys uncertainty? Or other than p-value, does the null hypothesis probability has some philosophical meanings? Or … I may go on with strange questions due to the statistical ambiguity of the statement. I’d appreciate any explanation how the p-value (the null hypothesis probability) is associated with the subsequent interpretation.

Another miscellaneous question is that If the number (the null hypothesis probability) from software packages is unfavorable or uninterpretable, can we attribute such ambiguity to systematical error?

- MNRAS, v. 345(2),pp.423-428 (2003): Ir
*on K features in the hard X-ray XMM-Newton spectrum of NGC 4151*(Schurch, Warwick, Griffiths, and Sembay)

The result of these modifications was a significantly improved fit (chi-sq=4859 for 4754 d.o.f). The model fit to the data is shown in Fig. 3 and the best-fitting parameter values for this revised model are listed as Model 2 in Table 1. The null hypothesis probability of this latter model (0.14) indicates that this is a reasonable representation of the spectral data to within the limits of the instrument calibration.

What is the rule of thumb interpretation of p-values or this null hypothesis probability in astronomy? How one knows that it is

*reasonable*as authors mentioned? How one knows the limits of the instrument calibration and compares quantitatively? How about the degrees of freedom? Some thousands! So large. Even with a million photons, according to the guideline for the number of bins^{[1]}I doubt that using chi-square goodness of fit for data with such large degree of freedom makes the test too conservative. Also, there should be distinction between the chi square minimization tactic and the chi square goodness of fit test. Using same data for both procedures will introduce bias. - MNRAS, v. 354, pp.10-24 (2004):
*Comparing the temperatures of galaxy clusters from hdrodynamical N-body simulations to Chandra and XMM-Newton observations*(Mazzotta, Rasia, Moscardini, and Tormen)

In particular, solid and dashed histograms refer to the fits for which the null hypothesis has a probiliy >5 percent (statistically acceptable fit) or <5 percent (statistically unacceptable fit), respectively. We also notice that the reduced chi square is always very close to unity, except in a few cases where the lower temperature components is at T~2keV, …

The last statement obscures the interpretation even more to the statement related to what “statistically (un)acceptable fit” really means. The notion of how good a model fits to data and how to test such hypothesis from the statistics standpoint seems different from that of astronomy.

- MNRAS, v.346(4),pp.1231-1241:
*X-ray and ultraviolet observations of the dwarf nova VW Hyi in quiescence*(Pandel, Córdova, and Howell)

As can be seen in the null hypothesis probabilities, the cemekl model is in very good agreement with the data.

The computed null hypothesis probabilities from the table 1 are 8.4, 25.7, 42.2, 1.6, 0.7*, and 13.1 percents (* is the result of MKCFLOW model, the rest are CEMEKL model). Probably, the criterion to declare a good fit is a p-value below 0.01 so that CEMEKL model cannot be rejected but MKCFLOW model can be rejected. Only one MKCFLOW which by accident resulted in a small p-value to say that MKCFLOW is not in agreement but the other choice, CEMEKL model is a good model. Too simplified model selection/assessment procedure. I wonder why CEMEKL was tried with various settings but MKCFLOW was only once. I guess there’s is an astrophysical reason of executing such model comparison study but statistically speaking, it looks like comparing measurement of 5 different kinds of oranges and one apple measured by a same ruler (the null hypothesis probability from the chi-square fitting). From the experimental design viewpoint, this is not well established study.

- MNRAS, 349, 1267 (2004):
*Predictions on the high-frequency polarization properties of extragalactic radio sources and implications for polarization measurements of the cosmic microwave background*(Tucci et al.)

The correlation is less clear in the samples at higher frequencies (r~ 0.2 and a null-hypothesis probability of >10^{-2}). However, these results are probably affected by the variability of sources, because we are comparing data taken at different epochs. A strong correlation (r>0.5 and a null-hypothesis probability of <10^{-4}) between 5 and 43 GHz is found for the VLA calibrators, the polarization of which is measured simultaneously at all frequencies.

I wonder what test statistic has been used to compute those p-values. I wonder if they truly meant p-value>0.01. At this level, most tools offer more precise number so as to make a suitable statement. The p-value (or the “null hypothesis probability”) is for testing whether r=0 or not. Even r is small, 0.2, still one can reject the null hypothesis if the threshold is 0.05. Therefore, >10^{-2} only add ambiguity. I think point estimates are enough to report the existence of weak and rather strong correlations. Otherwise, reporting both p-values and powers seems more appropriate.

- A&A, 342, 502 (1999):
*X-ray spectroscopy of the active dM stars: AD Leo and EV Lac*

(S. Sciortino, A. Maggio, F. Favata and S. Orlando)This fit yields a value of chi square of 185.1 with

**145**υ corresponding to a null-hypothesis probability of 1.4% to give an adequate description of the AD Leo coronal spectrum. In other words the adopted model does not give an acceptable description of available data. The analysis of the uncertainties of the best-fit parameters yields the 90% confidence intervals summarized in Table 5, together with the best-fit parameters. The confidence intervals show that we can only set a lower-limit to the value of the high-temperature. In order to obtain an acceptable fit we have added a third thermal MEKAL component and have repeated the fit leaving the metallicity free to vary. The resulting best-fit model is shown in Fig. 7. The fit formally converges with a value of chi square of 163.0 for**145**υ corresponding to a probability level of ~ 9.0%, but with the hotter component having a “best-fit” value of temperature extremely high (and unrealistic) and essentially unconstrained, as it is shown by the chi square contours in Fig. 8. In summary, the available data constrain the value of metallicity to be lower than solar, and they require the presence of a hot component whose temperature can only be stated to be higher than log (T) = 8.13. Available data do not allow us to discriminate between the (assumed) thermal and a non-thermal nature of this hot component.

…The fit yields a value of [FORMULA] of 95.2 (for 78 degree of freedom) that corresponds to a null hypothesis probability of 2.9%, i.e. a marginally acceptable fit. The limited statistic of the available spectra does not allow us to attempt a fit with a more complex model.After adding MEKAL, why the degree of freedom remains same? Also, what do they mean by

**the limited statistic of the available spectra**? - MNRAS348, 529 (2004):
*Powerful, obscured active galactic nuclei among X-ray hard, optically dim serendipitous Chandra sources*(Gandhi, Crawford, Fabian, Johnstone)

…, but a low f-test probability for this suggests that we cannot constrain the width with the current data.

While the rest frame equivalent width of the line is close to 1keV, its significance is marginal (f-test gives a null hypothesis probability of 0.1).Without a contingency table, nor comparing models, I was not sure how they executed the F-test. I could not find two degrees of freedom for the F-test. From the XSPEC’s account for the F-test (http://heasarc.gsfc.nasa.gov/docs/xanadu/xspec/manual/XSftest.html), we see two degrees of freedom, without them, no probability can be computed. Their usage of the F-test seems unconventional. The conventional application of the F-test is for comparing effects of multiple treatments (different levels of drug dosage including placebo); otherwise, it’s just a chi square goodness of fit test or t-test.

- Another occasion I came across is interpreting the null hypothesis probability of 0.99 as an indicator of a good fit; well, it’s overfitting. Not only too small null hypothesis probability but also close to one null hypothesis probability should raise a flag for cautions and warnings because the later indicating you are overdoing (too many free parameters for example).

There are some residuals of ambiguity after deducing the definition of the null hypothesis probability by playing with XSPEC and finding cases how this null hypothesis probability is used in literature. Authors sometimes added creative comments in order to interpret the null hypothesis probability from their data analysis, which I cannot understand without statistical imagination. Most can be overlooked, perhaps. Or instead, they are rather to be addressed to astronomers with statistical knowledge to resolve my confusion by the **null hypothesis probability**. I expect comments on how to view these quotes with statistical rigor from astronomers. The listed are personal. There are some more I really didn’t understand the points but many were straightforward in using the null hypothesis probabilities as p-values in statistical inference under the simple null hypothesis. I just listed some to display my first impression on these quotes most of which I couldn’t draw statistical caricatures out of them. Eventually, I hope some astronomers straighten the meaning and the usage of the **null hypothesis probability** without overruling basics in statistics.

I only want to add a caution when using the reduced chi-square as a model selection criteria. An indicator of a good-fit from a reduced chi^2 close to unity is only true when grouped data are independent so that the formula of degrees of freedom, roughly, the number of groups minus the number of free parameters, is valid. Personally I doubt this rule applied in spectral fitting that one cannot expect independence between two neighboring bins. In other words, given a source model and given total counts, two neighboring observations (counts in two groups) are correlated. The grouping rules like >25 or S/N>3 do not guarantee the independent assumption for the chi-square goodness of fit test although it may sufficient for Gaussian approximation. Statisticians devised various penalty terms and regularization methods for model selection that suits data types. One way to look is computing proper degrees of freedom, called **effective degrees of freedom** instead of n-p, to reflect the correlation across groups because of the chosen source model and calibration information. With a large number of counts or large number of groups, unless properly penalized, it is likely that the chi-square fit is hard to reject the null hypothesis than a statistic with smaller degrees of freedom because of the curse of dimensionality.

- Mann and Wald (1942), “On the Choice of the Number of Class Intervals in the Application of the Chi-square Test” Annals of Math. Stat. vol. 13, pp.306-7.