Classification and Clustering

hlee — Thu, 18 Sep 2008 23:48:43 +0000

Another deduced conclusion from reading preprints listed in arxiv/astro-ph is that astronomers tend to confuse classification and clustering and to mix up methodologies. They tend to think any algorithms from classification or clustering analysis serve their purpose since both analysis algorithms, no matter what, look like a black box. I mean a black box as in neural network, which is one of classification algorithms.

Simply put, classification is regression problem and clustering is mixture problem with unknown components. Defining a classifier, a regression model, is the objective of classification and determining the number of clusters is the objective of clustering. In classification, predefined classes exist such as galaxy types and star types and one wishes to know what prediction variables and their functional allow to separate Quasars from stars without individual spectroscopic observations by only relying on handful variables from photometric data. In clustering analysis, there is no predefined class but some plots visualize multiple populations and one wishes to determine the number of clusters mathematically not to be subjective in concluding remarks saying that the plot shows two clusters after some subjective data cleaning. A good example is that as photons from Gamma ray bursts accumulate, extracting features like F_{90} and F_{50} enables scatter plots of many GRBs, which eventually led people believe there are multiple populations in GRBs. Clustering algorithms back the hypothesis in a more objective manner opposed to the subjective manner of scatter plots with non statistical outlier elimination.

However, there are challenges to make a clear cut between classification and clustering both in statistics and astronomy. In statistics, missing data is the phrase people use to describe this challenge. Fortunately, there is a field called semi-supervised learning to tackle it. (Supervised learning is equivalent to classification and unsupervised learning is to clustering.) Semi-supervised learning algorithms are applicable to data, a portion of which has known class types and the rest are missing — astronomical catalogs with unidentified objects are a good candidate for applying semi-supervised learning algorithms.

From the astronomy side, the fact that classes are not well defined or subjective is the main cause of this confusion in classification and clustering and also the origin of this challenge. For example, will astronomer A and B produce same results in classifying galaxies according to Hubble’s tuning fork?^[1] We are not testing individual cognitive skills. Is there a consensus to make a cut between F9 stars and G0 stars? What make F9.5 star instead of G0? With the presence of error bars, how one is sure that the star is F9, not G0? I don’t see any decision theoretic explanation in survey papers when those stellar spectral classes are presented. Classification is generally for data with categorical responses but astronomer tend to make something used to be categorical to continuous and still remain to apply the same old classification algorithms designed for categorical responses.

From a clustering analysis perspective, this challenge is caused by outliers, or peculiar objects that do not belong to the majority. The size of this peculiar objects can make up a new class that is unprecedented before. Or the number is so small that a strong belief prevails to discard these data points, regarded as observational mistakes. How much we can trim the data with unavoidable and uncontrollable contamination (remember, we cannot control astronomical data as opposed to earthly kinds)? What is the primary cause defining the number of clusters? physics, statistics, astronomers’ experience in processing and cleaning data, …

Once the ambiguity in classification, clustering, and the complexity of data sets is resolved, another challenge is still waiting. Which black box? For the most of classification algorithms, Pattern Recognition and Machine Learning by C. Bishop would offer a broad spectrum of black boxes. Yet, the book does not include various clustering algorithms that statisticians have developed in addition to outlier detection. To become more rigorous in selecting a black box for clustering analysis and outlier detection, one is recommended to check,

Clustering Analysis by Everitt, Landau, and Leese
Data Clustering: Theory, Algorithms, and Applications by Gan, Ma, and Wu
collection of articles and presentation files on nonparametric multivariate analysis by Robert Serfling (Yes, the author of the classical book, Approximation Theorems of Mathematical Statistics), particularly about data depth and outlier detection and
robust statistics by Peter Huber

For me, astronomers tend to be in a haste owing to the pressure of publishing results immediately after data release and to overlook suitable methodologies for their survey data. It seems that there is no time for consulting machine learning specialists to verify the approaches they adopted. My personal prayer is that this haste should not be settled as a trend in astronomical survey and large data analysis.

Check out the project, GALAXY ZOO

[ArXiv] 1st week, June 2008

hlee — Mon, 09 Jun 2008 01:45:45 +0000

Despite no statistic related discussion, a paper comparing XSPEC and ISIS, spectral analysis open source applications might bring high energy astrophysicists’ interests this week.

[astro-ph:0806.0650] Kimball and Ivezi\’c
A Unified Catalog of Radio Objects Detected by NVSS, FIRST, WENSS, GB6, and SDSS (The catalog is available HERE. I’m always fascinated with the possibilities in catalog data sets which machine learning and statistics can explore. And I do hope that the measurement error columns get recognition from non astronomers.)
[astro-ph:0806.0820] Landau and Simeone
A statistical analysis of the data of Delta \alpha/ alpha from quasar absorption systems (It discusses Student t-tests from which confidence intervals for unknown variances and sample size based on Type I and II errors are obtained.)
[stat.ML:0806.0729] R. Girard
High dimensional gaussian classification (Model based – gaussian mixture approach – classification, although it is often mentioned as clustering in astronomy, on multi- dimensional data is very popular in astronomy)
[astro-ph:0806.0520] Vio and Andreani
A Statistical Analysis of the “Internal Linear Combination” Method in Problems of Signal Separation as in CMB Observations (Independent component analysis, ICA is discussed)
[astro-ph:0806.0560] Nobel and Nowak
Beyond XSPEC: Towards Highly Configurable Analysis (The flow of spectral analysis with XSPEC and Sherpa has not been accepted smoothly; instead, it has been a personal struggle. It seems the paper considers XSPEC as a black box, which I completely agree with. The main objective of the paper is comparing XSPEC and ISIS)
[astro-ph:0806.0113] Casandjian and Grenier
A revised catalogue of EGRET gamma-ray sources (The maximum likelihood detection method, which I never heard from statistical literature, is utilized)

The AstroStat Slog » black box

Classification and Clustering

[ArXiv] 1st week, June 2008