The AstroStat Slog » Galaxies Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 [ArXiv] Voronoi Tessellations Wed, 28 Oct 2009 14:29:24 +0000 hlee As a part of exploring spatial distribution of particles/objects, not to approximate via Poisson process or Gaussian process (parametric), nor to impose hypotheses such as homogenous, isotropic, or uniform, various nonparametric methods somewhat dragged my attention for data exploration and preliminary analysis. Among various nonparametric methods, the one that I fell in love with is tessellation (state space approaches are excluded here). Computational speed wise, I believe tessellation is faster than kernel density estimation to estimate level sets for multivariate data. Furthermore, conceptually constructing polygons from tessellation is intuitively simple. However, coding and improving algorithms is beyond statistical research (check books titled or key-worded partially by computational geometry). Good news is that for computation and getting results, there are some freely available softwares, packages, and modules in various forms.

As a part of introducing nonparametric statistics, I wanted to write about applications of computation geometry from the nonparametric 2/3 dimensional density estimation perspective. Also, the following article came along when I just began to collect statistical applications in astronomy (my [ArXiv] series). This [arXiv] paper, in fact, initiated me to investigate Voronoi Tessellations in astronomy in general.

Voronoi Tessellations and the Cosmic Web: Spatial Patterns and Clustering across the Universe
by Rien van de Weygaert

Since then, quite time has passed. In the mean time, I found more publications in astronomy specifically using tessellation as a main tool of nonparametric density estimation and for data analysis. Nonetheless, in general, topics in spatial statistics tend to be unrecognized or almost ignored in analyzing astronomical spatial data (I mean data points with coordinate information). Many seem only utilizing statistics partially or not at all. Some might want to know how often Voronoi tessellation is applied in astronomy. Here, I listed results from my ADS search by limiting tessellation in title key words. :

Then, the topic has been forgotten for a while until this recent [arXiv] paper, which reminded me my old intention for introducing tessellation for density estimation and for understanding large scale structures or clusters (astronomers’ jargon, not the term in machine or statistical learning).

[arxiv:stat.ME:0910.1473] Moment Analysis of the Delaunay Tessellation Field Estimator
by M.N.M van Lieshout

Looking into plots of the papers by van de Weygaert or van Lieshout, without mathematical jargon and abstraction, one can immediately understand what Voronoi and Delaunay Tessellation is (Delaunay Tessellation is also called as Delaunay Triangulation (wiki). Perhaps, you want to check out wiki:Delaunay Tessellation Field Estimator as well). Voronoi tessellations have been adopted in many scientific/engineering fields to describe the spatial distribution. Astronomy is not an exception. Voronoi Tessellation has been used for field interpolation.

van de Weygaert described Voronoi tessellations as follows:

  1. the asymptotic frame for the ultimate matter distribution,
  2. the skeleton of the cosmic matter distribution,
  3. a versatile and flexible mathematical model for weblike spatial pattern, and
  4. a natural asymptotic result of an evolution in which low-density expanding void regions dictate the spatial organization of the Megaparsec universe, while matter assembles in high-density filamentary and wall-like interstices between the voids.

van Lieshout derived explicit expressions for the mean and variance of Delaunay Tessellatoin Field Estimator (DTFE) and showed that for stationary Poisson processes, the DTFE is asymptotically unbiased with a variance that is proportional to the square intensity.

We’ve observed voids and filaments of cosmic matters with patterns of which theory hasn’t been discovered. In general, those patterns are manifested via observed galaxies, both directly and indirectly. Individual observed objects, I believe, can be matched to points that construct Voronoi polygons. They represent each polygon and investigating its distributional properly helps to understand the formation rules and theories of those patterns. For that matter, probably, various topics in stochastic geometry, not just Voronoi tessellation, can be adopted.

There are plethora information available on Voronoi Tessellation such as the website of International Symposium on Voronoi Diagrams in Science and Engineering. Two recent meeting websites are ISVD09 and ISVD08. Also, the following review paper is interesting.

Centroidal Voronoi Tessellations: Applications and Algorithms (1999) Du, Faber, and Gunzburger in SIAM Review, vol. 41(4), pp. 637-676

By the way, you may have noticed my preference for Voronoi Tessellation over Delaunay owing to the characteristics of this centroidal Voronoi that each observation is the center of each Voronoi cell as opposed to the property of Delaunay triangulation that multiple simplices are associated one observation/point. However, from the perspective of understanding the distribution of observations as a whole, both approaches offer summaries and insights in a nonparametric fashion, which I put the most value on.

]]> 0
SINGS Wed, 07 Oct 2009 01:30:41 +0000 hlee

From SINGS (Spitzer Infrared Nearby Galaxies Survey): Isn’t it a beautiful Hubble tuning fork?

As a first year graduate student of statistics, because of the rumor that Prof. C.R.Rao won’t teach any more and because of his fame, the most famous statistician alive, I enrolled his “multivariate analysis” class without thinking much. Everything is smooth and easy for him and he has incredible memories of equations and proofs. However, I only grasped intuitive concepts like why the method works, not details of mathematics, theorems, and their proofs. Instantly, I began to think how methods can be applied to astronomical data. After a few lessons, I desperately wanted to try out multivariate analysis methods to classify galactic morphology.

The dream died shortly because there’s no data set that can be properly fed into statistical methods for classification. I spent quite time on searching some astronomical data bases including ADS. This was before SDSS or VizieR become popular as now. Then, I thought about applying them to classify supernovae because understanding the pattern of their light curves tells a lot of the history of our universe (Type Ia SNe are standard candle) and because I know some publicly available SN light curves. Immediately, I realize that individual light curves are biased from the sampling perspective. I do not know how to correct them for multivariate analysis. I also thought about applying multivariate analysis methods to stellar spectral types and stars of different mechanical systems (single, binary, association, etc). I thought about how to apply newly learned methods to every astronomical objects that I learned, from sunspots to AGNs.

Regardless of target objects to be scrutinized under this fascinating subject “multivariate analysis,” two factors kept discouraged me: one was that I didn’t have enough training to develop new statistical models in a couple of weeks to reflect unique statistical challenges embedded in data that have missings, irregularities, non-iid, outliers and others that are hardly transcribed into statistical setting, and the other, which was more critical, was that no accessible astronomical database repository for statistical learning. Without deep knowledge in astronomy and trained skills to handle astronomical data, catalogs are generally useless. Those catalogs and data sets in archives are different from data sets from data repositories in machine learning (these data sets are intuitive).

Astronomers would think analyzing toy/mock data sets is not scientific because it’s not leading to any new discovery which they always make. From data analyst viewpoints, scientific advances mean finding tools that summarize data in an optimal manner. As I demanded in Astroinformatics, methods for retrieving information can be attempted and validated with well understood, astrophysically devastated data sets. Pythagoras theorem was proved not only once but there are 39 different ways to prove it.

Seeing this nice poster image (the full resolution image of 56MB is available from the link), brought me some memory of my enthusiasm of applying statistical learning methods for better knowledge discovery. As you can see there are so many different types of galaxies and often times there is no clear boundary between them – consider classifying blurry galaxies by eyes: a spiral can be classified as a irregular, for example. Although I wish for automatic classification of these astrophysical objects, because of difficulties in composing a training set for classification or collecting data of distinctive manifold groups for clustering, as much as complexity that this tuning fork shows, machine learning procedures is equally complicated to be developed. Complex topology of astronomical objects seems to be the primary reason of lacking in statistical learning applications compared to other fields.

Nonetheless, multivariable analysis can be useful for viewing relations from different perspectives, apart from known physics models. It may help to develop more fine tuned physics model by taking latent variables into account that are found from statistical learning processes. Such attempts, I believe, can assist astronomers to design telescopes and to invent efficient ways to collect/analyze data by knowing which features are more significant than others to understand morphological shape of galaxies, patterns in light curves, spectral types, etc. When such experiences accumulate, different insights of physics can kick in like scientists scrambled and assembled galaxies into a tuning fork that led developing various evolution models.

To make a long story short, you have two choices: one, just enjoy these beautiful pictures and apprehend the complexity of our universe, or two, this picture of Hubble’s tuning fork can be inspirational to you for advances in astroinformatics. Whichever path you choose, it’s your time worthy.

]]> 0
A book by David Freedman Tue, 10 Feb 2009 20:37:41 +0000 hlee A continuation from my posting, titled circumspect frequentist.

Title: Statistical Models: Theory and Practice (click for the publisher’s website)
My one line review, rather a comment several months ago was

Bias in asymptotic standard errors is not a familiar topic for astronomers

and I don’t understand why I wrote it but I think I came up this comment owing to my pursuit of modeling measurement errors occurring in astronomical researches.

My overall impression of the book was that astronomers might not fancy it because of the cited examples and models quite irrelevant to astronomy. On the contrary, I liked it because it reflects what statistics ought to be in the real data analysis world. This does not mean the book covers every bit of statistics. When you teach statistics, you don’t expect student’s learning curve of statistical logistics is continuous. You only hope that they jump the discontinuity points successfully and you give every effort to lower the steps of these discontinuity points. The book looked to offering comforts to ease such efforts or to hint promises for almost continuous learning curves. The perspective and scope of the book was very impressive to me at that time.

It is sad to learn brilliant minded people passing away before their insights reach others who need them. I admire professors at Berkeley, not only because of their research activities and contributions but also because of their pedagogical contributions to statistics and its applications to many fields including astronomy (J. Neyman and E. Scott. are as familiar to statisticians as to astronomers, for example. Their papers about the spatial distribution of galaxies are, to my knowledge, well sought among astronomers).

]]> 0
Bipartisanship Wed, 10 Dec 2008 17:41:58 +0000 hlee We have seen the word “bipartisan” often during the election and during the on-going recession period. Sometimes, I think that the bipartisanship is not driven by politicians but it’s driven by media, commentator, and interpreters.

me: Why Bayesian methods?
astronomers: Because Bayesian is robust. Because frequentist method is not robust.

By intention, I made the conversation short. Obviously, I didn’t ask all astronomers the same question and therefore, this conversation does not reflect the opinion of all astronomers. Nevertheless, this summarizes what I felt at CfA.

I was educated in frequentist school which I didn’t realize before I come to CfA. Although I didn’t take their courses, there were a few Bayesian professors (I took two but it’s nothing to do with this bipartisanship. Contents were just foundations of statistics). However, I found that getting ideas and learning brilliant algorithms by Bayesians were equally joyful as learning mature statistical theories from frequentists.

How come astronomers possess the idea that Bayesian statistics is robust and frequentist is not? Do they think that the celebrated Gaussian distribution and almighty chi-square methods compose the whole frequentist world? (Please, note that F-test, LRT, K-S test, PCA take little fraction of astronomers’ statistics other than chi-square methods according to astronomical publications, let alone Bayesian methods but no statistics can compete with chi-square methods in astronomy.) Is this why they think frequentist methods are not robust? The longer history is, the more flaws one finds so that no one expect chi-square stuffs are universal panacea. Applying the chi-square looks like growing numbers of epicycles. From the history, finding shortcomings makes us move forward, evolve, invent, change paradigms, etc., instead of saying that chi-square (frequentist) methods are not robust. I don’t think we spent time to learn chi-square stuffs from class. There are too many robust statistics that frequentists have developed. Text books have “robust statistics” in their titles are most likely written by frequentists. Did astronomers check text books and journals before saying frequentists methods are not robust? I’m curious how this bipartisanship, especially that one party is favored and the other is despised but blindly utilized in data analysis, has developed (Probably I should feel relieved about no statistics dictatorship in the astronomical society and exuberant about the efforts of balancing between two parties from a small number of scientists).

Although I think more likely in a frequentist way, I don’t object Bayesian. It’s nothing different from learning mother tongues and cultures. Often times I feel excited how Bayesian get over some troubles that frequentists couldn’t.. If I exaggerate, finding what frequentists achieved but Bayesians haven’t yet or the other way around is similar to the event that by changing the paradigm from the geocentric universe to the heliocentric one could explain the motions of planets with simplicity instead of adding more numbers of epicycles and complicating the description of motions. I equally cherish results from both statistical cultures. Satisfying the simplicity and the fundamental laws including probability theories, is the most important in pursuing proper applications of statistics, not the bipartisanship.

My next post will be about “Robust Statistics” to rectify the notion of robustness that I acquired from CfA. I’d like to hear your, astronomer and statistician alike, thoughts on robustness associated with your statistical culture of choice. I only can write about robustness based what I read and was taught. This also can be biased. Perhaps, other statisticians advocate the astronomer’s notion that Bayesian is robust and frequentist is not. Not much communications with statisticians makes me difficult to obtain the general consensus. Equally likely, I don’t know every astronomer’s thoughts on robustness. Nonetheless, I felt the notion of robustness is different between statisticians and astronomers and this could generate some discussions.

I may sound like Joe Liberman, overall. But remember that tossing him one party to the other back and forth was done by media explicitly. People can be opinionated but I’m sure he pursued his best interests regardless of parties.

]]> 2
[ArXiv] 5th week, Apr. 2008 Mon, 05 May 2008 07:08:42 +0000 hlee Since I learned Hubble’s tuning fork[1] for the first time, I wanted to do classification (semi-supervised learning seems more suitable) galaxies based on their features (colors and spectra), instead of labor intensive human eye classification. Ironically, at that time I didn’t know there is a field of computer science called machine learning nor statistics which do such studies. Upon switching to statistics with a hope of understanding statistical packages implemented in IRAF and IDL, and learning better the contents of Numerical Recipes and Bevington’s book, the ignorance was not the enemy, but the accessibility of data was.

I’m glad to see this week presented a paper that I had dreamed of many years ago in addition to other interesting papers. Nowadays, I’m more and more realizing that astronomical machine learning is not simple as what we see from machine learning and statistical computation literature, which typically adopted data sets from the data repository whose characteristics are well known over the many years (for example, the famous iris data; there are toy data sets and mock catalogs, no shortage of data sets of public characteristics). As the long list of authors indicates, machine learning on astronomical massive data sets are never meant to be a little girl’s dream. With a bit of my sentiment, I offer the list of this week:

  • [astro-ph:0804.4068] S. Pires et al.
    FASTLens (FAst STatistics for weak Lensing) : Fast method for Weak Lensing Statistics and map making
  • [astro-ph:0804.4142] M.Kowalski et al.
    Improved Cosmological Constraints from New, Old and Combined Supernova Datasets
  • [astro-ph:0804.4219] M. Bazarghan and R. Gupta
    Automated Classification of Sloan Digital Sky Survey (SDSS) Stellar Spectra using Artificial Neural Networks
  • [gr-qc:0804.4144]E. L. Robinson, J. D. Romano, A. Vecchio
    Search for a stochastic gravitational-wave signal in the second round of the Mock LISA Data challenges
  • [astro-ph:0804.4483]C. Lintott et al.
    Galaxy Zoo : Morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey
  • [astro-ph:0804.4692] M. J. Martinez Gonzalez et al.
    PCA detection and denoising of Zeeman signatures in stellar polarised spectra
  • [astro-ph:0805.0101] J. Ireland et al.
    Multiresolution analysis of active region magnetic structure and its correlation with the Mt. Wilson classification and flaring activity

A relevant post related machine learning on galaxy morphology from the slog is found at svm and galaxy morphological classification

< Added: 3rd week May 2008>[astro-ph:0805.2612] S. P. Bamford et al.
Galaxy Zoo: the independence of morphology and colour

  1. Wikipedia link: Hubble sequence
]]> 0
The GREAT08 Challenge Fri, 29 Feb 2008 03:46:49 +0000 vlk Grand statistical challenges seem to be all the rage nowadays. Following on the heels of the Banff Challenge (which dealt with figuring out how to set the bounds for the signal intensity that would result from the Higgs boson) comes the GREAT08 Challenge (arxiv/0802.1214) to deal with one of the major issues in observational Cosmology, the effect of dark matter. As Douglas Applegate puts it:

We are organizing a competition specifically targeting the statistics and computer science communities. The challenge is to measure cosmic shear at a level sufficient for future surveys such as the Large Synaptic Survey Telescope. Right now, we’ve stripped out most of complex observational issues leaving a pure statistical inference problem. The competition kicks off this summer, but we want to give possible participants a chance to prepare.

The website will provide continual updates on the competition.

]]> 7
[ArXiv] SVM and galaxy morphological classification, Sept. 10, 2007 Wed, 12 Sep 2007 20:31:30 +0000 hlee From arxiv/astro-ph:0709.1359,
A robust morphological classification of high-redshift galaxies using support vector machines on seeing limited images. I Method description by M. Huertas-Company et al.

Machine learning and statistical learning become more and more popular in astronomy. Artificial Neural Network (ANN) and Support Vector Machine (SVM) are hardly missed when classifying on massive survey data is the objective. The authors provide a gentle tutorial on SVM for galactic morphological classification. Their source code GALSVM is linked for the interested readers.

One of the biggest challenges to apply SVM or other classification methods in astronomy is quantification of measures, or how to define parameters and variables physically meaningful and machine interpretable at the same time. The authors of arxiv/astro-ph:0709.1359 followed the idea of Abraham et. al. (1994), who introduced concentration. However, my impression so far tells me that standardized indices (like economic indicators) are hardly found for the classification purpose in astronomy. Astronomical Machine Learning consortium would accelerate understanding many populations in the Universe.

]]> 0
[ArXiv] GRB host galaxies, Aug. 10, 2007 Tue, 14 Aug 2007 16:47:04 +0000 hlee From arxiv/astro-ph:0708.1510v1
Connecting GRBs and galaxies: the probability of chance coincidence by Cobb and Bailyn

Without an optical afterglow, a galaxy within the 2 arc second error region of a GRB x-ray afterglow is identified as a host galaxy; however confusion can rise due to the facts that 1. the edge of a galaxy is diffused, 2. multiple sources could exist within 2 arc second error region, 3.the distance between the galaxy and the x-ray afterglow is measured by projection, and 4. lensing causes increase of brightness and position shifts. In this paper, the authors “investigated the fields of 72 GRBs in order to examine the general issue of associations between GRBs and host galaxies.”

The authors added some statistical issues on this matching GRBs and host galaxies but current knowledge and techniques seem short to tackle the problem. Yet, to prevent false discovery, the authors proposed strategic studies for the followings:

  • Gamma-ray luminosity indicators
  • Detection (or non-detection ) SNe (Supernova) for long-duration bursts
  • Classification of associated galaxy : long-duration and short-duration bursts are associated with late-type and early-type galaxies, respectively
  • Optical afterglow spectral absorption features
  • Visual detection of true host galaxy as happened with GRB 060912a
  • X-ray afterglow spectral emission lines, and
  • Strong lensing of x-ray afterglows

As multi-wavelength studies become popular nowadays, this source matching issue across bands continuously arises where statistics can contribute the validity of source matching methods. So far, those methods are incomprehensible to statisticians.

]]> 0
Photometric Redshifts Wed, 25 Jul 2007 06:28:40 +0000 hlee Since I began to subscribe arxiv/astro-ph abstracts, from an astrostatistical point of view, one of the most frequent topics has been photometric redshifts. This photometric redshift has been a popular topic as the catalog of remote photometric object observation multiplies its volume and sky survey projects in multiple bands lead to virtual observatories (VO – will discuss in the later posting). Just searching by photometric redshifts in google scholar and provides more than 2000 articles since 2000.

Quantifying redshifts is one of the key astronomical measures to identify the type of objects as well as to provide their distance. Typically, measuring redshifts requires spectral data, which are quite expensive in many aspects compared to photometric data. Let me explain a little what are spectral data and photometric data to enhance understandings for non astronomers.

Collecting photometric data starts from taking pictures with different filters. Through blue, yellow, red optical filters, or infrared, ultra-violet, X-ray filters, objects look different (or have different light intensity) and various astronomical objects can be identify via investigating pictures of many filter combinations. On the other hand, collecting spectral data starts from dispersing light through a specially designed prism. Because of this light dispersion, it takes longer to collect lights from a object and the smaller number of objects are recorded in a picture plate compared to collecting photometric data. A nice feature of this expensive spectral data is providing the physical condition of the object directly: first, the distance by the relative spectral line shifts of spectral lines; second, abundance (the metallic composition of the object), temperature, type of the object also from spectral lines. Therefore, utilizing photometric data to infer measures normally available from spectral data is a very attractive topic in astronomy.

However, there are many challenges. The massive volume of data and sampling bias*, like Malmquist bias (wiki) and Lutz-Kelker bias, hinder traditional regression techniques, where numerous statistical and machine learning methods have been introduced to make most of these photometric data to infer distances economically and quickly.

*((For a reference regarding these biases and astronomical distances, please check Distance Estimation in Cosmology by
Hendry, M. A. and Simmons, J. F. L., Vistas in Astronomy, vol. 39, Issue 3, pp.297-314.))

]]> 0
[ArXiv] Kernel Regression, June 20, 2007 Mon, 25 Jun 2007 17:27:54 +0000 hlee One of the papers from arxiv/astro-ph discusses kernel regression and model selection to determine photometric redshifts astro-ph/0706.2704. This paper presents their studies on choosing bandwidth of kernels via 10 fold cross-validation, choosing appropriate models from various combination of input parameters through estimating root mean square error and AIC, and evaluating their kernel regression to other regression and classification methods with root mean square errors from literature survey. They made a conclusion of flexibility in kernel regression particularly for data at high z.

Off the topic but worth to be notified:
1. They used AIC for model comparison. In spite of many advocates for BIC, choosing AIC would do a better job for analyzing catalog data (399,929 galaxies) since the penalty term in BIC with huge sample will lead to select the model of most parsimony.

2. Despite that more detailed discussion hasn’t been posted, I’d like to point out photometric redshift studies are more or less regression problems. Whether they use sophisticated and up-to-date classification schemes such as support vector machine (SVM), artificial neural network (ANN), or classical regression methods, the goal of the study in photometric redshifts is finding predictors for right classification and the model from those predictors. I wish there will be some studies on quantile regression, which receive many spotlights recently in economics.

3. Adaptive kernels were mentioned and the results of adaptive kernel regression are highly expected.

4. Comparing root mean square errors from various classification and regression models based on Sloan Digital Sky Survey (SDSS) EDR (Early Data Release) to DR5 (Date Release 5) might mislead the conclusion of choosing the best regression/classification method due to different sample sizes in EDR to DR5. Further formulation, especially asymptotic properties of these root mean square errors will be very useful to make a legitimate comparison among different regression/classification strategies.

]]> 0