[MADS] Mahalanobis distance

hlee — Mon, 09 Mar 2009 21:18:11 +0000

It bears the name of its inventor, Prasanta Chandra Mahalanobis. As opposed to the Euclidean distance, a household name, the name of this distance is rarely used but many pseudonyms exist with variations adapted into broad scientific disciplines and applications. Therefore, under different names, I believe that the Mahalanobis distance is frequently applied in exploring and analyzing astronomical data.

First, a simple definition of the Mahalanobis distance:
$$D^2(X_i)=(X_i-\bar{X})^T\hat{\Sigma}^{-1}(X_i-\bar{X})$$
It can be seen as an Euclidean distance after strandardizing multivariate data. In one way or the other, scatter plots and regression analyses (includes all sorts of fancy correlation studies) reflect the notion of this distance.

To my knowledge, the Mahalanobis distance is employed for exploring multivariate data when one is up to finding, diagnosing, or justifying removal of outliers, identical to astronomers’ 3 or 5 σ in univariate cases. Classical text books on multivariate data analysis or classification/clustering have detail information. One would like to check the wiki site here.

Surprisingly, despite its popularity, the terminology itself is under-represented in ADS. A single paper in ApJS, none in other major astronomical journals, was found with the name Mahalanobis in their abstracts.

ApJS, v.176, pp.276-292 (2008): The Palomar Testbed Interferometer Calibrator Catalog (van Belle et al)

Unfortunately their description and the usage of the Mahalanobis distance is quite ambiguous. See the quote:

The Mahalanobis distance (MD) is a multivariate generalization of one-dimensional Euclidean distance

They only showed MD for measuring distance among uncorrelated variables/metrics and didn’t introduce that the generalization is obtained from the covariance matrix.

Since standardization (generalization or normalization) is a pretty common practice, the lack of appearance in abstracts does not mean it’s not used in astronomy. So I did the full text search among A&A, ApJ, AJ, and MNRAS, which lead me four more publications containing the Mahalanobis distance. Quite less than I expected.

A&A, 330, 215 (1998): Study of an unbiased sample of B stars observed with Hipparcos: the discovery of a large amount of new slowly pulsating B stars (Waelkens et al.)
MNRAS, 334, 20 (2002): UBV(RI)C photometry of Hipparcos red stars (Koen et al.)
AJ, 99, 1108 (1990): Kinematics and composition of H II regions in spiral galaxies. II – M51, M101 and NGC 2403 (Zaritsky, Elston, and Hill)
A&A, 343, 496 (1999): An analysis of the incidence of the VEGA phenomenon among main-sequence and POST main-sequence stars (Plets and Vynckier )

The last two papers have the definition in the way how I know (p.1114 from Zaritsky, Elston, and Hill’s and p. 504 from Plets and Vynckier’s).

Including the usage given in these papers, the Mahalanobis distance is popularly used in exploratory data analysis (EDA): 1. measuring distance, 2. outlier removal, 3. checking normality of multivariate data. Due to its requirement of estimating the (inverse) covariance matrix, it shares tools with principal component analysis (PCA), linear discriminant analysis (LDA), and other methods requiring Wishart distributions.

By the way, Wishart distribution is also underrepresented in ADS. Only one paper appeared via abstract text search.
[2006MNRAS.372.1104P] Likelihood techniques for the combined analysis of CMB temperature and polarization power spectra (Percival, W. J.; Brown, M. L)

Lastly, I want to point out that estimating the covariance matrix and its inversion can be very challenging in various problems, which has lead people to develop numerous algorithms, strategies, and applications. These mathematical or computer scientific challenges and prescriptions are not presented in this post. Please, be aware that estimating the (inverse) covariance matrix is not simple as they are presented with real data.

Classification and Clustering

hlee — Thu, 18 Sep 2008 23:48:43 +0000

Another deduced conclusion from reading preprints listed in arxiv/astro-ph is that astronomers tend to confuse classification and clustering and to mix up methodologies. They tend to think any algorithms from classification or clustering analysis serve their purpose since both analysis algorithms, no matter what, look like a black box. I mean a black box as in neural network, which is one of classification algorithms.

Simply put, classification is regression problem and clustering is mixture problem with unknown components. Defining a classifier, a regression model, is the objective of classification and determining the number of clusters is the objective of clustering. In classification, predefined classes exist such as galaxy types and star types and one wishes to know what prediction variables and their functional allow to separate Quasars from stars without individual spectroscopic observations by only relying on handful variables from photometric data. In clustering analysis, there is no predefined class but some plots visualize multiple populations and one wishes to determine the number of clusters mathematically not to be subjective in concluding remarks saying that the plot shows two clusters after some subjective data cleaning. A good example is that as photons from Gamma ray bursts accumulate, extracting features like F_{90} and F_{50} enables scatter plots of many GRBs, which eventually led people believe there are multiple populations in GRBs. Clustering algorithms back the hypothesis in a more objective manner opposed to the subjective manner of scatter plots with non statistical outlier elimination.

However, there are challenges to make a clear cut between classification and clustering both in statistics and astronomy. In statistics, missing data is the phrase people use to describe this challenge. Fortunately, there is a field called semi-supervised learning to tackle it. (Supervised learning is equivalent to classification and unsupervised learning is to clustering.) Semi-supervised learning algorithms are applicable to data, a portion of which has known class types and the rest are missing — astronomical catalogs with unidentified objects are a good candidate for applying semi-supervised learning algorithms.

From the astronomy side, the fact that classes are not well defined or subjective is the main cause of this confusion in classification and clustering and also the origin of this challenge. For example, will astronomer A and B produce same results in classifying galaxies according to Hubble’s tuning fork?^[1] We are not testing individual cognitive skills. Is there a consensus to make a cut between F9 stars and G0 stars? What make F9.5 star instead of G0? With the presence of error bars, how one is sure that the star is F9, not G0? I don’t see any decision theoretic explanation in survey papers when those stellar spectral classes are presented. Classification is generally for data with categorical responses but astronomer tend to make something used to be categorical to continuous and still remain to apply the same old classification algorithms designed for categorical responses.

From a clustering analysis perspective, this challenge is caused by outliers, or peculiar objects that do not belong to the majority. The size of this peculiar objects can make up a new class that is unprecedented before. Or the number is so small that a strong belief prevails to discard these data points, regarded as observational mistakes. How much we can trim the data with unavoidable and uncontrollable contamination (remember, we cannot control astronomical data as opposed to earthly kinds)? What is the primary cause defining the number of clusters? physics, statistics, astronomers’ experience in processing and cleaning data, …

Once the ambiguity in classification, clustering, and the complexity of data sets is resolved, another challenge is still waiting. Which black box? For the most of classification algorithms, Pattern Recognition and Machine Learning by C. Bishop would offer a broad spectrum of black boxes. Yet, the book does not include various clustering algorithms that statisticians have developed in addition to outlier detection. To become more rigorous in selecting a black box for clustering analysis and outlier detection, one is recommended to check,

Clustering Analysis by Everitt, Landau, and Leese
Data Clustering: Theory, Algorithms, and Applications by Gan, Ma, and Wu
collection of articles and presentation files on nonparametric multivariate analysis by Robert Serfling (Yes, the author of the classical book, Approximation Theorems of Mathematical Statistics), particularly about data depth and outlier detection and
robust statistics by Peter Huber

For me, astronomers tend to be in a haste owing to the pressure of publishing results immediately after data release and to overlook suitable methodologies for their survey data. It seems that there is no time for consulting machine learning specialists to verify the approaches they adopted. My personal prayer is that this haste should not be settled as a trend in astronomical survey and large data analysis.

Check out the project, GALAXY ZOO

The AstroStat Slog » outliers

[MADS] Mahalanobis distance

Classification and Clustering