The AstroStat Slog » EDA

[MADS] Parallel Coordinates

hlee — Wed, 29 Jul 2009 06:02:18 +0000

Speaking of XAtlas from my previous post I tried another visualization tool called Parallel Coordinates on these Capella observations and two stars with multiple observations (AL Lac and IM Peg). As discussed in [MADS] Chernoff face, full description of the catalog is found from XAtlas website. The reason for choosing these stars is that among low mass stars, next to Capella (I showed 16), IM PEG (HD 21648, 8 times), and AR Lac (although different phases, 6 times) are most frequently observed. I was curious about which variation, within (statistical variation) and between (Capella, IM Peg, AL Lac), is dominant. How would they look like from the parametric space of High Resolution Grating Spectroscopy from Chandra?

Having 13 X-ray line and/or continuum ratios, a typical data display would be the 13 choose 2 combination of scatter plots as follows. Note that the upper left panels with three colors are drawn for the classification purpose (red: AL Lac, blue: IM Peg, green:Capella) while lower right ones are discolored for the clustering analysis purpose. These scatter plots are essential to exploratory data analysis but they do not convey information efficiently with these many scatter plots. In astronomical journals, thanks to astronomers’ a priori knowledge, a fewer pairs of important variables are selected and displayed to reduce the visualization complexity dramatically. Unfortunately, I cannot select physically important variables only.

I am not a well-knowledged astronomer but believe in reducing dimensionality by the research objective. The goal is set from asking questions like “what do you want from this multivariate data set?” classification (classification rule/regression model that separates three stars, Capella, AL Lac, and IM Peg), clustering (are three stars naturally clustered into three groups? Or are there different number of clusters even if they are not well visible from above scatter plots?), hypothesis testing (are they same type of stars or different?), point estimation and its confidence interval (means and their error bars), and variable selection (or dimension reduction). So far no statistical question is well defined (it can be good thing for new discoveries). Prior to any confirmatory data analysis, we’d better find a way to display this multidimensional data efficiently. I thought parallel coordinates serve the purpose well but surprisingly, it was never discussed in astronomical literature, at least it didn’t appear in ADS.

Each 13 variable was either normalized (left) or standardized (right). The parallel coordinate plot looks both simpler and more informative. Capella observations occupy relatively separable space than the other stars. It is easy to distinguish that one Capella observation is an obvious outlier to the rest which is hardly seen from scatter plots. It is clear that discriminant analysis or classical support vector machine type classification methods cannot separate AL Lac and IM Pec. Clustering based on distance measures of dissimilarity also cannot be applied in order to see a natural grouping of these two stars whereas Capella observations form its own cluster. To my opinion, parallel coordinates provide more information about multidimensional data (dim>3) in a simpler way than scatter plots of multivariate data. It naturally shows highly correlated variables within the same star observations or across all target stars. This insight from visualization is a key to devising methods of variable selection or reducing dimensionality in the data set.

Personal opinion is that not having an efficient and informative visualization tool for visualizing complex high resolution spectra in many detailed metrics, smoothed bivariate (trivariate at most) information such as hardness ratios and quantiles are utilized in displaying X-ray spectral data, instead. I’m not saying that the parallel coordinates are the ultimate answer to visualizing multivariate data but I’d like to emphasize that this method is more informative, intuitive and simple to understand the structure of relatively high dimensional data cloud.

Parallel coordinates has a long history. The earliest discussion I found was made in 1880ies. It became popular by Alfred Inselberg and gained recognition among statisticians by George Wegman (1990, Hyperdimensional Data Analysis Using Parallel Coordinates). Colorful images of the Sun, stars, galaxies, and their corona, interstellar gas, and jets are the eye catchers. I hope that data visualization tools gain equal spot lights since they summarize data and deliver lots of information. If images are well decorated cakes, then these tools from EDA are sophisticated and well baked cookies.

——————- [Added]
According to

[arxiv:0906.3979] The Golden Age of Statistical Graphics
Michael Friendly (2008)
Statistical Science, Vol. 23, No. 4, pp. 502-535

it is 1885. Not knowing French – if I knew I’d like to read Gauss’ paper immediately prior to anything – I don’t know what the reference is about.

[MADS] Chernoff face

hlee — Thu, 02 Apr 2009 16:00:41 +0000

I cannot remember when I first met Chernoff face but it hooked me up instantly. I always hoped for confronting multivariate data from astronomy applicable to this charming EDA method. Then, somewhat such eager faded, without realizing what’s happening. Tragically, this was mainly due to my absent mind.

After meeting Prof. Herman Chernoff unexpectedly – I didn’t know he is Professor Emeritus at Harvard – the urge revived but I didn’t have data, still then. Alas, another absent mindedness: I don’t understand why I didn’t realize that I already have the data, XAtlas for trying Chernoff faces until today. Data and its full description is found from the XAtlas website (click). For Chernoff face, references suggested in Wiki:Chernoff face are good. I believe some folks are already familiar with Chernoff faces from a New York Times article last year, listed in Wiki (or a subset characterized by baseball lovers?).

Capella is a X-ray bright star observed multiple times for Chandra calibration. I listed 16 ObsIDs in the figures below at each face, among 18+ Capella observations (Last time when I checked Chandra Data Archive, 18 Capella observations were available). These 16 are high resolution observations from which various metrics like interesting line ratios and line to continuum ratios can be extracted. I was told that optically it’s hard to find any evidence that Capella experienced catastrophic changes during the Chandra mission (about 10 years in orbit) but the story in X-ray can’t be very different. In a dismally short time period (10 years for a star is a flash or less), Capella could have revealed short time scale high energy activities via Chandra. I just wanted to illustrate that Chernoff faces could help visualizing such changes or any peculiarities through interpretation friendly facial expressions (Studies have confirmed babies’ ability in facial expression recognitions). So, what do you think? Do faces look similar/different to you? Can you offer me astronomical reasons for why a certain face (ObsID) is different from the rest?

p.s. In order to draw these Chernoff faces, check descriptions of these R functions, faces() (yields the left figure) or faces2() (yields the right figure) by clicking on the function of your interest. There are other variations and other data analysis systems offer different fashioned tools for drawing Chernoff faces to explore multivariate data. Welcome any requests for plots in pdf. These jpeg files look too coarse on my screen.

p.p.s. Variables used for these faces are line ratios and line to continuum ratios, and the order of these input variables could change countenance but impressions from faces will not change (a face with distinctive shapes will look different than other faces even after the order of metrics/variables is scrambled or using different Chernoff face illustration tools). Mapping, say from an astronomical metric to the length of lips was not studied in this post.

p.p.p.s. Some data points are statistical outliers, not sure about how to explain strange numbers (unrealistic values for line ratios). I hope astronomers can help me to interpret those peculiar numbers in line/continuum ratios. My role is to show that statistics can motivate astronomers for new discoveries and to offer different graphics tools for enhancing visualization. I hope these faces motivate some astronomers to look into Capella in XAtlas (and beyond) in details with different spectacles, and find out the reasons for different facial expressions in Capella X-ray observations. Particularly, ObsID 1199 is most questionable to me.

Books – a boring title

hlee — Fri, 25 Jan 2008 16:53:21 +0000

I have been observing some sorts of misconception about statistics and statistical nomenclature evolution in astronomy, which I believe, are attributed to the lack of references in the astronomical society. There are some textbooks designed for junior/senior science and engineering students, which are likely unknown to astronomers. Example-wise, these books are not suitable, to my knowledge. Although I never expect astronomers to learn standard graduate (mathematical) statistics textbooks, I do wish astronomers go beyond Numerical Recipes (W. H. Press, S. A. Teukolsky, W. T. Vetterling, & B. P. Flannery) and Error Data Reduction and Analysis for the Physical Sciences (P. R. Bevington & D. K. Robinson). Here are some good ones written by astronomers, engineers, and statisticians:

The motivation of writing this posting was originated to Vinay’s recommendation: Practical Statistics for Astronomers (J.V.Wall and C.R.Jenkins), which provides many statistical insights and caveats that astronomers tend to ignore. Without looking at the error distribution and the properties of data, astronomers jump into chi-square and correlation. If someone reads the book, he/she will be careful on adopting statistics of common practice in astronomy, developed many decades ago, and founded on strong assumptions, not compatible with modern data sets. The book addresses many concerns that have been growing in my mind for astronomers and introduces various statistical methods applicable in astronomy.

The view points of astronomers without in-class statistics education but with full readership of this book, would be different from mine. The book mentioned unbiasedness, consistency, closedness, and robustness of statistics, which normally are not discussed nor proved in astronomy papers. Therefore, those readers may miss the insights, caveats, and contents-between-the-lines of the book, which I care about. To reduce such gap, as for quick and easy understanding of classical statistics, I recommend Cartoon Guide to Statistics (Larry Gonick, Woollcott Smith Business & Investing Collins) as a first step. This cartoon book enhances fundamentals in statistics only with fun and a friendly manner, and provides everything that rudimentary textbooks offer.

If someone wants to know beyond classical statistics (so called frequentist statistics) and likes to know popular Bayesian statistics, astronomy professor Phil Gregory’s Bayesian Logical Data Analysis for the Physical Sciences is recommended. If one likes to know little bit more on the modern statistics of frequentists and Bayesians, All of Statistics (Larry Wasserman) is recommended. I realize that textbooks for non-statistics students are too thick to go through in a short time (The book for senior engineering students at Penn State I used for teaching was Probability and Statistics for Engineering and the Sciences by Jay. L Devore, 4th and 5th edition and it was about 600 pages. The current edition is 736 pages). One of well received textbooks for graduate students in electrical engineering is Probability, Random Variables and Stochastic Processes (A. Papoulis & S.U. Pillai). I remember the book offers a rather less abstract definition of measure and practical examples (Personally, Hermite polynomials was useful from the book).

For a casual reading about statistics and its 20th century history, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century (D. Salsburg) is quite nice.

Statistics is not just for best fit analysis and error bars. It is a wonderful telescope extracts correct information when it is operated carefully to the right target by the manual. It gets rid of atmospheric and other blurring factors when statistics is understood righteously. It is not a black box nor a magic, as many people think.

The era of treating everything gaussian is over decades ago. Because of the central limit theorem and the delta method (a good example is log-transformation), many statistics asymptotically follows the normal (gaussian) distribution but there are various families of distributions. Because of possible bias in the chi-square method, the error bar cannot guarantee the appointed coverage, like 95%. There are also nonparametric statistics, known for robustness, whereas it may be less efficient than statistics of distribution family assumption. Yet, it does not require model assumption. Also, Bayesian statistics works wonderfully if correct information on priors, suitable likelihood models, and computing powers for hierarchical models and numerical integration are provided.

Before jumping into the chi-square for fitting and testing at the same time, to prevent introducing bias, exploratory data analysis is required for better understanding data and for seeking a suitable statistic and its assumptions. The exploratory data analysis starts from simple scatter plots and box plots. A little statistical care for data and good interests in the truth of statistical methods are all I am asking for. I do wish that these books could assist the realization of my wishes.

—————————————————————————-
[1.] Most of links to books are from amazon.com but there is no personal affiliation to the company.

[2.] In addition to the previous posting on chi-square, what is so special about chi square in astronomy, I’d like to mention possible bias in chi-square fitting and testing. It is well known that utilizing the same data set for fitting, which results in parameter estimates so called in astronomy best fit values and error bars, and testing based on these parameter estimates brings out bias so that the best fit is biased from the true parameter value and the error bar does not match the aimed coverage. See the problem from Aneta’s an example of chi2 bias in fitting x-ray spectra

[3.] More book recommendation is welcome.