The AstroStat Slog » clustering

[ArXiv] classifying spectra

hlee — Fri, 23 Oct 2009 00:08:07 +0000

[arXiv:stat.ME:0910.2585]
Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications
by Murphy, Dean, and Raftery

Classifying or clustering (or semi supervised learning) spectra is a very challenging problem from collecting statistical-analysis-ready data to reducing the dimensionality without sacrificing complex information in each spectrum. Not only how to estimate spiky (not differentiable) curves via statistically well defined procedures of estimating equations but also how to transform data that match the regularity conditions in statistics is challenging.

Another reason that astrophysics spectroscopic data classification and clustering is more difficult is that observed lines, and their intensities and FWHMs on top of continuum are related to atomic database and latent variables/hyper parameters (distance, rotation, absorption, column density, temperature, metalicity, types, system properties, etc). Frequently it becomes very challenging mixture problem to separate lines and to separate lines from continuum (boundary and identifiability issues). These complexity only appears in astronomy spectroscopic data because we only get indirect or uncontrolled data ruled by physics, as opposed to the the meat species spectra in the paper. These spectroscopic data outside astronomy are rather smooth, observed in controlled wavelength range, and no worries for correcting recession/radial velocity/red shift/extinction/lensing/etc.

Although the most relevant part to astronomers, i.e. spectroscopic data processing is not discussed in this paper, the most important part, statistical learning application to complex curves, spectral data, is well described. Some astronomers with appropriate data would like to try the variable selection strategy and to check out the classification methods in statistics. If it works out, it might save space for storing spectral data and time to collect high resolution spectra. Please, keep in mind that it is not necessary to use the same variable selection strategy. Astronomers can create better working versions for classification and clustering purpose, like Hardness Ratios, often used to reduce the dimensionality of spectral data since low total count spectra are not informative in the full energy (wavelength) range. Curse of dimensionality!.

SINGS

hlee — Wed, 07 Oct 2009 01:30:41 +0000

From SINGS (Spitzer Infrared Nearby Galaxies Survey): Isn’t it a beautiful Hubble tuning fork?

As a first year graduate student of statistics, because of the rumor that Prof. C.R.Rao won’t teach any more and because of his fame, the most famous statistician alive, I enrolled his “multivariate analysis” class without thinking much. Everything is smooth and easy for him and he has incredible memories of equations and proofs. However, I only grasped intuitive concepts like why the method works, not details of mathematics, theorems, and their proofs. Instantly, I began to think how methods can be applied to astronomical data. After a few lessons, I desperately wanted to try out multivariate analysis methods to classify galactic morphology.

The dream died shortly because there’s no data set that can be properly fed into statistical methods for classification. I spent quite time on searching some astronomical data bases including ADS. This was before SDSS or VizieR become popular as now. Then, I thought about applying them to classify supernovae because understanding the pattern of their light curves tells a lot of the history of our universe (Type Ia SNe are standard candle) and because I know some publicly available SN light curves. Immediately, I realize that individual light curves are biased from the sampling perspective. I do not know how to correct them for multivariate analysis. I also thought about applying multivariate analysis methods to stellar spectral types and stars of different mechanical systems (single, binary, association, etc). I thought about how to apply newly learned methods to every astronomical objects that I learned, from sunspots to AGNs.

Regardless of target objects to be scrutinized under this fascinating subject “multivariate analysis,” two factors kept discouraged me: one was that I didn’t have enough training to develop new statistical models in a couple of weeks to reflect unique statistical challenges embedded in data that have missings, irregularities, non-iid, outliers and others that are hardly transcribed into statistical setting, and the other, which was more critical, was that no accessible astronomical database repository for statistical learning. Without deep knowledge in astronomy and trained skills to handle astronomical data, catalogs are generally useless. Those catalogs and data sets in archives are different from data sets from data repositories in machine learning (these data sets are intuitive).

Astronomers would think analyzing toy/mock data sets is not scientific because it’s not leading to any new discovery which they always make. From data analyst viewpoints, scientific advances mean finding tools that summarize data in an optimal manner. As I demanded in Astroinformatics, methods for retrieving information can be attempted and validated with well understood, astrophysically devastated data sets. Pythagoras theorem was proved not only once but there are 39 different ways to prove it.

Seeing this nice poster image (the full resolution image of 56MB is available from the link), brought me some memory of my enthusiasm of applying statistical learning methods for better knowledge discovery. As you can see there are so many different types of galaxies and often times there is no clear boundary between them – consider classifying blurry galaxies by eyes: a spiral can be classified as a irregular, for example. Although I wish for automatic classification of these astrophysical objects, because of difficulties in composing a training set for classification or collecting data of distinctive manifold groups for clustering, as much as complexity that this tuning fork shows, machine learning procedures is equally complicated to be developed. Complex topology of astronomical objects seems to be the primary reason of lacking in statistical learning applications compared to other fields.

Nonetheless, multivariable analysis can be useful for viewing relations from different perspectives, apart from known physics models. It may help to develop more fine tuned physics model by taking latent variables into account that are found from statistical learning processes. Such attempts, I believe, can assist astronomers to design telescopes and to invent efficient ways to collect/analyze data by knowing which features are more significant than others to understand morphological shape of galaxies, patterns in light curves, spectral types, etc. When such experiences accumulate, different insights of physics can kick in like scientists scrambled and assembled galaxies into a tuning fork that led developing various evolution models.

To make a long story short, you have two choices: one, just enjoy these beautiful pictures and apprehend the complexity of our universe, or two, this picture of Hubble’s tuning fork can be inspirational to you for advances in astroinformatics. Whichever path you choose, it’s your time worthy.

[MADS] Parallel Coordinates

hlee — Wed, 29 Jul 2009 06:02:18 +0000

Speaking of XAtlas from my previous post I tried another visualization tool called Parallel Coordinates on these Capella observations and two stars with multiple observations (AL Lac and IM Peg). As discussed in [MADS] Chernoff face, full description of the catalog is found from XAtlas website. The reason for choosing these stars is that among low mass stars, next to Capella (I showed 16), IM PEG (HD 21648, 8 times), and AR Lac (although different phases, 6 times) are most frequently observed. I was curious about which variation, within (statistical variation) and between (Capella, IM Peg, AL Lac), is dominant. How would they look like from the parametric space of High Resolution Grating Spectroscopy from Chandra?

Having 13 X-ray line and/or continuum ratios, a typical data display would be the 13 choose 2 combination of scatter plots as follows. Note that the upper left panels with three colors are drawn for the classification purpose (red: AL Lac, blue: IM Peg, green:Capella) while lower right ones are discolored for the clustering analysis purpose. These scatter plots are essential to exploratory data analysis but they do not convey information efficiently with these many scatter plots. In astronomical journals, thanks to astronomers’ a priori knowledge, a fewer pairs of important variables are selected and displayed to reduce the visualization complexity dramatically. Unfortunately, I cannot select physically important variables only.

I am not a well-knowledged astronomer but believe in reducing dimensionality by the research objective. The goal is set from asking questions like “what do you want from this multivariate data set?” classification (classification rule/regression model that separates three stars, Capella, AL Lac, and IM Peg), clustering (are three stars naturally clustered into three groups? Or are there different number of clusters even if they are not well visible from above scatter plots?), hypothesis testing (are they same type of stars or different?), point estimation and its confidence interval (means and their error bars), and variable selection (or dimension reduction). So far no statistical question is well defined (it can be good thing for new discoveries). Prior to any confirmatory data analysis, we’d better find a way to display this multidimensional data efficiently. I thought parallel coordinates serve the purpose well but surprisingly, it was never discussed in astronomical literature, at least it didn’t appear in ADS.

Each 13 variable was either normalized (left) or standardized (right). The parallel coordinate plot looks both simpler and more informative. Capella observations occupy relatively separable space than the other stars. It is easy to distinguish that one Capella observation is an obvious outlier to the rest which is hardly seen from scatter plots. It is clear that discriminant analysis or classical support vector machine type classification methods cannot separate AL Lac and IM Pec. Clustering based on distance measures of dissimilarity also cannot be applied in order to see a natural grouping of these two stars whereas Capella observations form its own cluster. To my opinion, parallel coordinates provide more information about multidimensional data (dim>3) in a simpler way than scatter plots of multivariate data. It naturally shows highly correlated variables within the same star observations or across all target stars. This insight from visualization is a key to devising methods of variable selection or reducing dimensionality in the data set.

Personal opinion is that not having an efficient and informative visualization tool for visualizing complex high resolution spectra in many detailed metrics, smoothed bivariate (trivariate at most) information such as hardness ratios and quantiles are utilized in displaying X-ray spectral data, instead. I’m not saying that the parallel coordinates are the ultimate answer to visualizing multivariate data but I’d like to emphasize that this method is more informative, intuitive and simple to understand the structure of relatively high dimensional data cloud.

Parallel coordinates has a long history. The earliest discussion I found was made in 1880ies. It became popular by Alfred Inselberg and gained recognition among statisticians by George Wegman (1990, Hyperdimensional Data Analysis Using Parallel Coordinates). Colorful images of the Sun, stars, galaxies, and their corona, interstellar gas, and jets are the eye catchers. I hope that data visualization tools gain equal spot lights since they summarize data and deliver lots of information. If images are well decorated cakes, then these tools from EDA are sophisticated and well baked cookies.

——————- [Added]
According to

[arxiv:0906.3979] The Golden Age of Statistical Graphics
Michael Friendly (2008)
Statistical Science, Vol. 23, No. 4, pp. 502-535

it is 1885. Not knowing French – if I knew I’d like to read Gauss’ paper immediately prior to anything – I don’t know what the reference is about.

Classification and Clustering

hlee — Thu, 18 Sep 2008 23:48:43 +0000

Another deduced conclusion from reading preprints listed in arxiv/astro-ph is that astronomers tend to confuse classification and clustering and to mix up methodologies. They tend to think any algorithms from classification or clustering analysis serve their purpose since both analysis algorithms, no matter what, look like a black box. I mean a black box as in neural network, which is one of classification algorithms.

Simply put, classification is regression problem and clustering is mixture problem with unknown components. Defining a classifier, a regression model, is the objective of classification and determining the number of clusters is the objective of clustering. In classification, predefined classes exist such as galaxy types and star types and one wishes to know what prediction variables and their functional allow to separate Quasars from stars without individual spectroscopic observations by only relying on handful variables from photometric data. In clustering analysis, there is no predefined class but some plots visualize multiple populations and one wishes to determine the number of clusters mathematically not to be subjective in concluding remarks saying that the plot shows two clusters after some subjective data cleaning. A good example is that as photons from Gamma ray bursts accumulate, extracting features like F_{90} and F_{50} enables scatter plots of many GRBs, which eventually led people believe there are multiple populations in GRBs. Clustering algorithms back the hypothesis in a more objective manner opposed to the subjective manner of scatter plots with non statistical outlier elimination.

However, there are challenges to make a clear cut between classification and clustering both in statistics and astronomy. In statistics, missing data is the phrase people use to describe this challenge. Fortunately, there is a field called semi-supervised learning to tackle it. (Supervised learning is equivalent to classification and unsupervised learning is to clustering.) Semi-supervised learning algorithms are applicable to data, a portion of which has known class types and the rest are missing — astronomical catalogs with unidentified objects are a good candidate for applying semi-supervised learning algorithms.

From the astronomy side, the fact that classes are not well defined or subjective is the main cause of this confusion in classification and clustering and also the origin of this challenge. For example, will astronomer A and B produce same results in classifying galaxies according to Hubble’s tuning fork?^[1] We are not testing individual cognitive skills. Is there a consensus to make a cut between F9 stars and G0 stars? What make F9.5 star instead of G0? With the presence of error bars, how one is sure that the star is F9, not G0? I don’t see any decision theoretic explanation in survey papers when those stellar spectral classes are presented. Classification is generally for data with categorical responses but astronomer tend to make something used to be categorical to continuous and still remain to apply the same old classification algorithms designed for categorical responses.

From a clustering analysis perspective, this challenge is caused by outliers, or peculiar objects that do not belong to the majority. The size of this peculiar objects can make up a new class that is unprecedented before. Or the number is so small that a strong belief prevails to discard these data points, regarded as observational mistakes. How much we can trim the data with unavoidable and uncontrollable contamination (remember, we cannot control astronomical data as opposed to earthly kinds)? What is the primary cause defining the number of clusters? physics, statistics, astronomers’ experience in processing and cleaning data, …

Once the ambiguity in classification, clustering, and the complexity of data sets is resolved, another challenge is still waiting. Which black box? For the most of classification algorithms, Pattern Recognition and Machine Learning by C. Bishop would offer a broad spectrum of black boxes. Yet, the book does not include various clustering algorithms that statisticians have developed in addition to outlier detection. To become more rigorous in selecting a black box for clustering analysis and outlier detection, one is recommended to check,

Clustering Analysis by Everitt, Landau, and Leese
Data Clustering: Theory, Algorithms, and Applications by Gan, Ma, and Wu
collection of articles and presentation files on nonparametric multivariate analysis by Robert Serfling (Yes, the author of the classical book, Approximation Theorems of Mathematical Statistics), particularly about data depth and outlier detection and
robust statistics by Peter Huber

For me, astronomers tend to be in a haste owing to the pressure of publishing results immediately after data release and to overlook suitable methodologies for their survey data. It seems that there is no time for consulting machine learning specialists to verify the approaches they adopted. My personal prayer is that this haste should not be settled as a trend in astronomical survey and large data analysis.

Check out the project, GALAXY ZOO

my first AAS. IV. clustering

hlee — Fri, 20 Jun 2008 03:42:06 +0000

I was questioned by two attendees, acquainted before the AAS, if I can suggest them clustering methods relevant to their projects. After all, we spent quite a time to clarify the term clustering.

The statistician’s and astronomer’s understanding of clustering is different:
- classification vs. clustering or supervised learning vs. unsupervised learning: the former terms from the pairs indicate the fact that the scientist already knows types of objects in his hands. A photometry data set with an additional column saying star, galaxy, quasar, and unknown is a target for classification or supervised learning. Simply put, classification is finding a rule with photometric colors that could classify these different type objects. If there’s no additional column but the scatter plots or plots after dimension reduction manifesting grouping patterns, it is clustering or unsupervised learning whose goal is finding hyperplanes to separates these clusters optimally; in other words, answering these questions, are there real clusters? If so, how many? is the objective of clustering/unsupervised learning. Overall, rudimentarily, the presence of an extra column of types differentiates between classification and clustering.
- physical clustering vs. statistical clustering:
  Cosmologists and alike are interested in clusters/clumps of matters/particles/objects. For astrophysicists, clusters are associated with spatial evolution of the universe. Inquiries related to clustering from astronomers are more likely related to finding these spatial clumps statistically, which is a subject of stochastic geometry or spatial statistics. On the other hand, statisticians and data analysts like to investigate clusters in a reparameterized multi-dimensional space. Distances computed do not follow the fundamental laws of physics (gravitation, EM, weak, and strong) but reflect relationships in the multi-dimensional space; for example, in a CM diagram, stars of a kind are grouped. The consensus between two communities about clustering is that the number of clusters is unknown, where the plethora of classification methods cannot be applied and that the study objectives are seeking methodologies for quantifying clusters .
astronomer’s clustering problems are either statistical classification (closed to semi-supervised learning) or spatial statistics.
The way of manifesting noisy clusters in the universe or quantifying the current status of matter distribution leads to the very fundamentals of the birth of the universe, where spatial statistics can be a great partner. In the era of photometric redshifts, various classification techniques enhances the accuracy of prediction.
astronomer’s testing the reality of clusters seems limited: Cosmology problems have been tackled as inverse problem. Based on theoretical cosmology models, simulations are performed and the results are transformed into some surrogate parameters. These surrogates are generally represented by some smooth curves or straight lines in a plot where observations made their debut as points with bidirectional error bars (so called measurement errors). The judgment about the cosmological model under the test happens by a simple regression (correlation) or eyes on these observed data points. If observations and a curve from a cosmological model presented in a 2D plot match well, the given cosmological model is confirmed in the conclusion section. Personally, this procedure of testing cosmological models to account for clusters of the universe can be developed in a more statistically rigorous fashion instead of matching straight lines.
Challenges to statisticians in astronomy, measurement errors: In (statistical) learning, I believe, there has been no standard procedure to account for astronomers’ measurement errors into modeling. I think measurements errors are, in general, ignored because systematics errors are not recognized in statistics. On the other hand, in astronomy, measurement errors accompanying data, are a very crucial piece of information, particularly for verifying the significance of the observations. Often this measurement errors became denominator in the χ² function which is treated as a χ² distribution to get best fits and confidence intervals.

Personal lessons from two short discussions at the AAS were more collaboration between statisticians and astronomers to include measurement errors in classification or semi-supervised learning particularly for nowadays when we are enjoying plethora of data sets and moving forward with a better aid from statisticians for testing/verifying the existence of clusters beyond fitting a straight line.

[ArXiv] 3rd week, May 2008

hlee — Mon, 26 May 2008 18:59:38 +0000

Not many this week, but there’s a great read.

[stat.ME:0805.2756] Fionn Murtagh
The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering
[astro-ph:0805.2945] Martin, de Jong, & Rix
A comprehensive Maximum Likelihood analysis of the structural properties of faint Milky Way satellites
[astro-ph:0805.2946] Kelly, Fan, & Vestergaard
A Flexible Method of Estimating Luminosity Functions [my subjective comment is added at the bottom]
[stat.ME:0805.3220] Bayarri, Berger, Datta
Objective Bayes testing of Poisson versus inflated Poisson models (will it be of use when one is dealing with many zero background counts, underpopulated above zero background counts, and underpopulated source counts?)

[Comment] You must read it. It can serve as a very good Bayesian tutorial for astronomers. I think there’s a typo, nothing major, plus/minus sign in the likelihood, though. Tom Loredo kindly has informed through his extensive slog comments about Schechter function and this paper made me appreciate the gamma distribution more. Schechter function and the gamma density function share the same equation although the objective of their use does not have much to be shared (Forgive my Bayesian ignorance in the extensive usage of gamma distribution except the fact it’s a conjugate of Poisson or exponential distribution).

FYI, there was another recent arxiv paper on zero-inflation [stat.ME:0805.2258] by Bhattacharya, Clarke, & Datta
A Bayesian test for excess zeros in a zero-inflated power series distribution

language barrier

hlee — Wed, 13 Feb 2008 20:41:32 +0000

Last week, I was at Tufts colloquium and happened to have a conversation with a computer scientist about density based clustering. I understood density as probabilistic density and was recollecting a paper by Fraley and Raftery (Model-Based Clustering, Discriminant Analysis, and Density Estimation, JASA, 2002, 97, p.458) and other similar papers I saw in engineering journals like IEEE transactions. For a few moments, I felt uncomfortable and she explained that density meant “how dense observations are.” Density based clustering was meant to be distance based clustering, like k-means, minimum spanning tree, most likely nonparametric approaches.

Although words are same, the first impression and their usage is quite different from society to society (even among statisticians). One word I’m very reluctant to use both to astronomers and statisticians is model. I’m quite confused at the reactions from both sides. To clarify meanings, implications, or intentions, some clever adjectives must accompany these common words; however, once one gets used to these jargons, adjectives are felt redundant to your fellow scientists/colleagues, whereas the other gets lost and seeks explanation of the usage by related examples and backgrounds.

Not only simple words, like model and density, there are more jargons requires inter-disciplinary semantic experts. Yet, patience of explaining and open-mindedness would easily assist to get over language barriers in any interdisciplinary works.

[ Would you mind sharing your experience of any language barrier? ]

[ArXiv] Three Classes of GRBs, July 21, 2007

hlee — Wed, 25 Jul 2007 07:22:52 +0000

From arxiv/astro-ph:0705.4020v2
Statistical Evidence for Three classes of Gamma-ray Bursts by T. Chattopadhyay et. al.

In general, gamma-ray bursts (GRBs) are classified into two groups: long (>2 sec) and short (<2 sec) duration bursts. Nonetheless, there have been some studies including arxiv/astro-ph:0705.4020v2 that statistically proved the optimal existence of 3 clusters. The pioneer work of GRB clusterings was based on hierarchical clustering methods by Mukerjee et. al.(Three Types of Gamma-Ray Bursts)

The new feature of this article is that Chattopadhyay et. al. applied the k-means and the Dirichlet process model based clustering method to confirm three classes of GRBs. In addition, they investigated classes among 21 GRBs with known red shifts (Click for those GRBs).