The AstroStat Slog » Objects Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 The Perseid Project [Announcement] Mon, 02 Aug 2010 21:21:35 +0000 vlk There is an ambitious project afoot to build a 3D map of a meteor stream during the Perseids on Aug 11-12. I got this missive about it from the organizer, Chris Crawford:

This will be one of the better years for Perseids; the moon, which often interferes with the Perseids, will not be a problem this year. So I’m putting together something that’s never been done before: a spatial analysis of the Perseid meteor stream. We’ve had plenty of temporal analyses, but nobody has ever been able to get data over a wide area — because observations have always been localized to single observers. But what if we had hundreds or thousands of people all over North America and Europe observing Perseids and somebody collected and collated all their observations? This is crowd-sourcing applied to meteor astronomy. I’ve been working for some time on putting together just such a scheme. I’ve got a cute little Java applet that you can use on your laptop to record the times of fall of meteors you see, the spherical trig for analyzing the geometry (oh my aching head!) and a statistical scheme that I *think* will reveal the spatial patterns we’re most likely to see — IF such patterns exist. I’ve also got some web pages describing the whole shebang. They start here:

I think I’ve gotten all the technical, scientific, and mathematical problems solved, but there remains the big one: publicizing it. It won’t work unless I get hundreds of observers. That’s where you come in. I’m asking two things of you:

1. Any advice, criticism, or commentary on the project as presented in the web pages.
2. Publicizing it. If we can get that ol’ Web Magic going, we could get thousands of observers and end up with something truly remarkable. So, would you be willing to blog about this project on your blog?
3. I would be especially interested in your comments on the statistical technique I propose to use in analyzing the data. It is sketched out on the website here:

Given my primitive understanding of statistical analysis, I expect that your comments will be devastating, but if you’re willing to take the time to write them up, I’m certainly willing to grit my teeth and try hard to understand and implement them.

Thanks for any help you can find time to offer.

Chris Crawford

]]> 0
An Instructive Challenge Tue, 15 Jun 2010 18:38:56 +0000 vlk This question came to the CfA Public Affairs office, and I am sharing it with y’all because I think the solution is instructive.

A student had to figure out the name of a stellar object as part of an assignment. He was given the following information about it:

  • apparent [V] magnitude = 5.76
  • B-V = 0.02
  • E(B-V) = 0.00
  • parallax = 0.0478 arcsec
  • radial velocity = -18 km/s
  • redshift = 0 km/s

He looked in all the stellar databases but was unable to locate it, so he asked the CfA for help.

Just to help you out, here are a couple of places where you can find comprehensive online catalogs:

See if you can find it!

Answer next week month.

Update (2010-aug-02):
The short answer is, I could find no such star in any commonly available catalog. But that is not the end of the story. There does exist a star in the Hipparcos catalog, HIP 103389, that has approximately the right distance (21 pc), radial velocity (-16.1 km/s), and V magnitude (5.70). It doesn’t match exactly, and the B-V is completely off, but that is the moral of the story.

The thing is, catalogs are not perfect. The same objects often have very different numerical entries in different catalogs. This could be due to a variety of reasons, such as different calibrations, different analysers, or even intrinsic variations in the source. And you can bet your bottom dollar that the quoted statistical uncertainties in the quantities do not account for the observed variance. Take the B-V value, for instance. It is 0.5 for HIP 103389, but the initial problem stated that it was 0.02, which makes it an A type star. But if it were an A type star at 21 pc, it should have had a magnitude of V~1.5, much brighter than the required 5.76!

I think this illustrates one of the fundamental tenets of science as it is practiced, versus how it is taught. The first thing that a practicing scientist does (especially one not of the theoretical persuasion) is to try and see where the data might be wrong or misleading. It should only be included in analysis after it passes various consistency checks and is deemed valid. The moral of the story is, don’t trust data blindly just because it is a “number”.

]]> 0
SINGS Wed, 07 Oct 2009 01:30:41 +0000 hlee

From SINGS (Spitzer Infrared Nearby Galaxies Survey): Isn’t it a beautiful Hubble tuning fork?

As a first year graduate student of statistics, because of the rumor that Prof. C.R.Rao won’t teach any more and because of his fame, the most famous statistician alive, I enrolled his “multivariate analysis” class without thinking much. Everything is smooth and easy for him and he has incredible memories of equations and proofs. However, I only grasped intuitive concepts like why the method works, not details of mathematics, theorems, and their proofs. Instantly, I began to think how methods can be applied to astronomical data. After a few lessons, I desperately wanted to try out multivariate analysis methods to classify galactic morphology.

The dream died shortly because there’s no data set that can be properly fed into statistical methods for classification. I spent quite time on searching some astronomical data bases including ADS. This was before SDSS or VizieR become popular as now. Then, I thought about applying them to classify supernovae because understanding the pattern of their light curves tells a lot of the history of our universe (Type Ia SNe are standard candle) and because I know some publicly available SN light curves. Immediately, I realize that individual light curves are biased from the sampling perspective. I do not know how to correct them for multivariate analysis. I also thought about applying multivariate analysis methods to stellar spectral types and stars of different mechanical systems (single, binary, association, etc). I thought about how to apply newly learned methods to every astronomical objects that I learned, from sunspots to AGNs.

Regardless of target objects to be scrutinized under this fascinating subject “multivariate analysis,” two factors kept discouraged me: one was that I didn’t have enough training to develop new statistical models in a couple of weeks to reflect unique statistical challenges embedded in data that have missings, irregularities, non-iid, outliers and others that are hardly transcribed into statistical setting, and the other, which was more critical, was that no accessible astronomical database repository for statistical learning. Without deep knowledge in astronomy and trained skills to handle astronomical data, catalogs are generally useless. Those catalogs and data sets in archives are different from data sets from data repositories in machine learning (these data sets are intuitive).

Astronomers would think analyzing toy/mock data sets is not scientific because it’s not leading to any new discovery which they always make. From data analyst viewpoints, scientific advances mean finding tools that summarize data in an optimal manner. As I demanded in Astroinformatics, methods for retrieving information can be attempted and validated with well understood, astrophysically devastated data sets. Pythagoras theorem was proved not only once but there are 39 different ways to prove it.

Seeing this nice poster image (the full resolution image of 56MB is available from the link), brought me some memory of my enthusiasm of applying statistical learning methods for better knowledge discovery. As you can see there are so many different types of galaxies and often times there is no clear boundary between them – consider classifying blurry galaxies by eyes: a spiral can be classified as a irregular, for example. Although I wish for automatic classification of these astrophysical objects, because of difficulties in composing a training set for classification or collecting data of distinctive manifold groups for clustering, as much as complexity that this tuning fork shows, machine learning procedures is equally complicated to be developed. Complex topology of astronomical objects seems to be the primary reason of lacking in statistical learning applications compared to other fields.

Nonetheless, multivariable analysis can be useful for viewing relations from different perspectives, apart from known physics models. It may help to develop more fine tuned physics model by taking latent variables into account that are found from statistical learning processes. Such attempts, I believe, can assist astronomers to design telescopes and to invent efficient ways to collect/analyze data by knowing which features are more significant than others to understand morphological shape of galaxies, patterns in light curves, spectral types, etc. When such experiences accumulate, different insights of physics can kick in like scientists scrambled and assembled galaxies into a tuning fork that led developing various evolution models.

To make a long story short, you have two choices: one, just enjoy these beautiful pictures and apprehend the complexity of our universe, or two, this picture of Hubble’s tuning fork can be inspirational to you for advances in astroinformatics. Whichever path you choose, it’s your time worthy.

]]> 0
different views Mon, 13 Jul 2009 00:33:25 +0000 hlee An email was forwarded with questions related to the data sets found in “Be an INTEGRAL astronomer”. Among the sets, the following scatter plot is based on the Crab data.


If you do not mind the time predictor, it is hard to believe that this is a light curve, time dependent data. At a glance, this data set represents a simple block design for the one-way ANOVA. ANOVA stands for Analysis of Variance, which is not a familiar nomenclature for astronomers.

Consider a case that you have a very long strip of land that experienced FIVE different geological phenomena. What you want to prove is that crop productivity of each piece of land is different. So, you make FIVE beds and plant same kind seeds. You measure the location of each seed from the origin. Each bed has some dozens of seeds, which are very close to each other but their distances are different. On the other hand, the distance between planting beds are quite far unable to say that plants in the test bed A affects plants in B. In other words, A and B are independent suiting for my statistical inference procedure by the F-test. All you need is after a few months, measuring the total weight of crop yield from each plant (with measurement errors).

Now, let’s look at the plot above. If you replace distance to time and weight to flux, the pattern in data collection and its statistical inference procedure matches with the one-way ANOVA. It’s hard to say this data set is designed for time series analysis apart from the complication in statistical inference due to measurement errors. How to design the statistical study with measurement errors, huge gaps in time, and unequal time intervals is complex and unexplored. It depends highly on the choice of inference methods, assumptions on error i.e. likelihood function construction, prior selection, and distribution family properties.

Speaking of ANOVA, using the F-test means that we assume residuals are Gaussian from which one can comfortably modify the model with additive measurement errors. Here I assume there’s no correlation in measurement errors and plant beds. How to parameterize the measurement errors into model depends on such assumptions as well as how to assess sampling distribution and test statistics.

Although I know this Crab nebula data set is not for the one-way ANOVA, the pattern in the scatter plot drove me to test the data set. The output said to reject the null hypothesis of statistically equivalent flux in FIVE time blocks. The following is R output without measurement errors.

Df Sum Sq Mean Sq F value Pr(>F)
factor 4 4041.8 1010.4 143.53 < 2.2e-16 ***
Residuals 329 2316.2 7.0

If the gaps are minor, I would consider time series with missing data next. However, the missing pattern does not agree with my knowledge in missing data analysis. I wonder how astronomers handle such big gaps in time series data, what assumptions they would take to get a best fit and its error bar, how the measurement errors are incorporated into statistical model, what is the objective of statistical inference, how to relate physical meanings to statistical significant parameter estimates, how to assess the model choice is proper, and more questions. When the contest is over, if available, I’d like to check out any statistical treatments to answer these questions. I hope there are scientists who consider similar statistical issues in these data sets by the INTEGRAL team.

]]> 0
accessing data, easier than before but… Tue, 20 Jan 2009 17:59:56 +0000 hlee Someone emailed me for globular cluster data sets I used in a proceeding paper, which was about how to determine the multi-modality (multiple populations) based on well known and new information criteria without binning the luminosity functions. I spent quite time to understand the data sets with suspicious numbers of globular cluster populations. On the other hand, obtaining globular cluster data sets was easy because of available data archives such as VizieR. Most data sets in charts/tables, I acquire those data from VizieR. In order to understand science behind those data sets, I check ADS. Well, actually it happens the other way around: check scientific background first to assess whether there is room for statistics, then search for available data sets.

However, if you are interested in massive multivariate data or if you want to have a subsample from a gigantic survey project, impossible all to be documented in contrast to those individual small catalogs, one might like to learn a little about Structured Query Language (SQL). With nice examples and explanation, some Tera byte data are available from SDSS. Instead of images in fits format, one can get ascii/table data sets (variables of million objects are magnitudes and their errors; positions and their errors; classes like stars, galaxies, AGNs; types or subclasses like elliptical galaxies, spiral galaxies, type I AGN, type Ia, Ib, Ic, and II SNe, various spectral types, etc; estimated variables like photo-z, which is my keen interest; and more). Furthermore, thousands of papers related to SDSS are available to satisfy your scientific cravings. (Here are Slog postings under SDSS tag).

If you don’t want to limit yourself with ascii tables, you may like to check the quick guide/tutorial of Gator, which aggregated archives of various missions: 2MASS (Two Micron All-Sky Survey), IRAS (Infrared Astronomical Satellite), Spitzer Space Telescope Legacy Science Programs, MSX (Midcourse Space Experiment), COSMOS (Cosmic Evolution Survey), DENIS (Deep Near Infrared Survey of the Southern Sky), and USNO-B (United States Naval Observatory B1 Catalog). Probably, you also want to check NED or NASA/IPAC Extragalactic Database. As of today, the website said, 163 million objects, 170 million multiwavelength object cross-IDs, 188 thousand associations (candidate cross-IDs), 1.4 million redshifts, and 1.7 billion photometric measurements are accessible, which seem more than enough for data mining, exploring/summarizing data, and developing streaming/massive data analysis tools.

Probably, astronomers might wonder why I’m not advertising Chandra Data Archive (CDA) and its project oriented catalog/database. All I can say is that it’s not independent statistician friendly. It is very likely that I am the only statistician who tried to use data from CDA directly and bother to understand the contents. I can assure you that without astronomers’ help, the archive is just a hot potato. You don’t want to touch it. I’ve been there. Regardless of how painful it is, I’ve kept trying to touch it since It’s hard to resist after knowing what’s in there. Fortunately, there are other data scientist friendly archives that are quite less suffering compared to CDA. There are plethora things statisticians can do to improve astronomers’ a few decade old data analysis algorithms based on Gaussian distribution, iid assumption, or L2 norm; and to reflect the true nature of data and more relaxed assumptions for robust analysis strategies than for traditionally pursued parametric distribution with specific models (a distribution free method is more robust than Gaussian distribution but the latter is more efficient) not just with CDA but with other astronomical data archives. The latter like vizieR or SDSS provides data sets which are less painful to explore with without astronomical software/package familiarity.

Computer scientists are well aware of UCI machine learning archive, with which they can validate their new methods with previous ones and empirically prove how superior their methods are. Statisticians are used to handle well trimmed data; otherwise we suggest strategies how to collect data for statistical inference. Although tons of data collecting and sampling protocols exist, most of them do not match with data formats, types, natures, and the way how data are collected from observing the sky via complexly structured instruments. Some archives might be extensively exclusive to the funded researchers and their beneficiaries. Some archives might be super hot potatoes with which no statistician wants to involve even though they are free of charges. I’d like to warn you overall not to expect the well tabulated simplicity of text book data sets found in exploratory data analysis and machine learning books.

Some one will raise another question why I do not speculate VOs (virtual observatories, click for slog postings) and Google Sky (click for slog postings), which I praised in the slog many times as good resources to explore the sky and to learn astronomy. Unfortunately, for the purpose of direct statistical applications, either VOs or Google sky may not be fancied as much as their names’ sake. It is very likely spending hours exploring these facilities and later you end up with one of archives or web interfaces that I mentioned above. It would be easier talking to your nearest astronomer who hopefully is aware of the importance of statistics and could offer you a statistically challenging data set without worries about how to process and clean raw data sets and how to build statistically suitable catalogs/databases. Every astronomer of survey projects builds his/her catalog and finds common factors/summary statistics of the catalog from the perspective of understanding/summarizing data, the primary goal of executing statistical analyses.

I believe some astronomers want to advertise their archives and show off how public friendly they are. Such advertising comments are very welcome because I intentionally left room for those instead of listing more archives I heard of without hands-on experience. My only wish is that more statisticians can use astronomical data from these archives so that the application section of their papers is filled with data from these archives. As if with sunspots, I wish that more astronomical data sets can be used to validate methodologies, algorithms, and eventually theories. I sincerely wish that this shall happen in a short time before I become adrift from astrostatistics and before I cannot preach about the benefits of astronomical data and their archives anymore to make ends meet.

There is no single well known data repository in astronomy like UCI machine learning archive. Nevertheless, I can assure you that the nature of astronomical data and catalogs bear various statistical problems and many of those problems have never been formulated properly towards various statistical inference problems. There are so many statistical challenges residing in them. Not enough statisticians bother to look these data because of the gigantic demands for statisticians from uncountably many data oriented scientific disciplines and the persistent shortage in supplies.

]]> 3
Likelihood Ratio Technique Thu, 15 Jan 2009 22:01:28 +0000 hlee I wonder what Fisher, Neyman, and Pearson would say if they see “Technique” after “Likelihood Ratio” instead of “Test.” A presenter’s saying “Likelihood Ratio Technique” for source identification, I couldn’t resist checking it out not to offend founding fathers of the likelihood principle in statistics since “Technique” sounded derogatory to be attached with “Likelihood” to my ears. I thank, above all, the speaker who kindly gave me the reference about this likelihood ratio technique.

On the likelihood ratio for source identification by Sutherland and Saunders (1992) in MNRAS vol. 259, pp. 413-420.

Their computed likelihood ratio (L) correspond to Bayes factor by the form (P(source model)/P(background model)). Considering the fact that it’s binary, source or background, L shares the form of a hazard ratio (L=p(source)/p(not source)=p(source)/(1-p(source)). Since the likelihood can be based on probability density function, the authors defined “Likelihood ratio” literally by taking the ratio of two likelihood functions. Not taking the statistical direction as in the likelihood ratio test and the Neyman-Pearson lemma, naming their method as “likelihood ratio technique” seems proper, and it’s not derogatory any more. The focus of the paper is that estimating the probability density functions of backgrounds and sources more or less empirically without concerns toward general statistical inference. Hitherto, the large Bayes factor, large L (likelihood ratio) of a source, or large posterior probability of a source (p(genuine|m,c,x,y)=L/(1+L)) is just an indicator that the given source is more likely a real source.

In summary, the likelihoods of source and of background (of numerator and of denominator) are empirically obtained based on physics which turned out to have matching parametric distributions well discussed in statistics. What is different from statistics is that the likelihood ratio didn’t lead to testing hypothesis based on Neyman-Pearson Lemma. Computing the likelihood ratio is utilized as an indicator of a source. Well, often times, it’s hard to judge the real content of an astronomical study by its name, title, or abstract due to my statistically oriented stereotypes.

]]> 0
Did they, or didn’t they? Tue, 20 May 2008 04:10:23 +0000 vlk Earlier this year, Peter Edmonds showed me a press release that the Chandra folks were, at the time, considering putting out describing the possible identification of a Type Ia Supernova progenitor. What appeared to be an accreting white dwarf binary system could be discerned in 4-year old observations, coincident with the location of a supernova that went off in November 2007 (SN2007on). An amazing discovery, but there is a hitch.

And it is a statistical hitch, and involves two otherwise highly reliable and oft used methods giving contradictory answers at nearly the same significance level! Does this mean that the chances are actually 50-50? Really, we need a bona fide statistician to take a look and point out the errors of our ways..

The first time around, Voss & Nelemans (arXiv:0802.2082) looked at how many X-ray sources there were around the candidate progenitor of SN2007on (they also looked at 4 more galaxies that hosted Type Ia SNe and that had X-ray data taken prior to the event, but didn’t find any other candidates), and estimated the probability of chance coincidence with the optical position. When you expect 2.2 X-ray sources/arcmin2 near the optical source, the probability of finding one within 1.3 arcsec is tiny, and in fact is around 0.3%. This result has since been reported in Nature.

However, Roelofs et al. (arXiv:0802.2097) went about getting better optical positions and doing better bore-sighting, and as a result, they measured the the X-ray position accurately and also carried out Monte Carlo simulations to estimate the error on the measured location. And they concluded that the actual separation, given the measurement error in the location, is too large to be a chance coincidence, 1.18±0.27 arcsec. The probability that the two locations are the same of finding offsets in the observed range is ~1% [see Tom's clarifying comment below].

Well now, ain’t that a nice pickle?

To recap: there are so few X-ray sources in the vicinity of the supernova that anything close to its optical position cannot be a coincidence, BUT, the measured error in the position of the X-ray source is not copacetic with the optical position. So the question for statisticians now: which argument do you believe? Or is there a way to reconcile these two calculations?

Oh, and just to complicate matters, the X-ray source that was present 4 years ago had disappeared when looked for in December, as one would expect if it was indeed the progenitor. But on the other hand, a lot of things can happen in 4 years, even with astronomical sources, so that doesn’t really confirm a physical link.

]]> 5 Wed, 12 Mar 2008 19:32:49 +0000 hlee, a cool website I heard from Harvard Astronomy Professor Doug Finkbeiner’s class (Principles of Astronomical Measurements), does a complex job of matching your images of unknown locations or coordinates to sources in catalogs. By providing your images in various formats, they provide astrometric calibration meta-data and lists of known objects falling inside the field of view.

Astrometry is a branch of astronomy but the algorithms of locating stars and galaxies mainly come from computer scientists whose fundamental ideas are from statistics and mathematics.

]]> 0
[ArXiv] A fast Bayesian object detection Wed, 05 Mar 2008 21:46:48 +0000 hlee This is a quite long paper that I separated from [Arvix] 4th week, Feb. 2008:
      [astro-ph:0802.3916] P. Carvalho, G. Rocha, & M.P.Hobso
      A fast Bayesian approach to discrete object detection in astronomical datasets – PowellSnakes I
As the title suggests, it describes Bayesian source detection and provides me a chance to learn the foundation of source detection in astronomy.

First, I’d like to point out that my initial concerns from [astro-ph:0707.1611] Probabilistic Cross-Identification of Astronomical Sources are explained in sections 2, 3 and 6 about parameter space, its dimensionality, and priors in Bayesian model selection.

Second, I’d rather concisely list the contents of the paper as follows: (1) priors, various types but rooms were left for further investigations in future; (2) templates (such as point spread function, I guess), crucial for defining sources, and gaussian random field for noise; (3) optimization strategies for fast computation (source detection implies finding maxima and integration for evidence); (4) comparison with other works; (5) upper bound, tuning the threshold for acceptance/rejection to minimize the symmetric loss; (6) challenges of dealing likelihoods in Fourier space from incorporating colored noise (opposite to white noise); (7) decision theory from computing false negatives (undetected objects) and false positives (spurious objects). Many issues in computing Bayesian evidence, priors, tunning parameter relevant posteriors, and the peaks of maximum likelihoods; and approximating templates and backgrounds are carefully presented. The conclusion summarizes their PowellSnakes algorithm pictorially.

Thirdly, although my understanding of object detection and linking it to Bayesian techniques is very superficial, my reading this paper tells me that they propose some clever ways of searching full 4 dimensional space via Powell minimization (It seems to be related with profile likelihoods for a fast computation but it was not explicitly mentioned) and the detail could direct statisticians’ attentions for the improvement of computing efficiency and acceleration.

Fourth, I’d like to talk about my new knowledge that I acquired from this paper about errors in astronomy. Statisticians usually surprise at astronomical catalogs that in general come with errors next to single measurements. These errors are not measurement errors (errors calculated from repeated observations) but obtained from Fisher information owing to Cramer-Rao Lower Bound. The template likelihood function leads this uncertainty measure on each observation.

Lastly, in astronomy, there are many empirical rules, effects, and laws that bear uncommon names. Generally these are sophisticated rules of thumb or approximations of some phenomenon (for instance, Hubble’s law, though it’s well known) but they have been the driving away factors when statisticians reading astronomy papers. On the other hand, despite overwhelming names, when it gets to the point, the objective of mentioning such names is very statistical like regression (fitting), estimating parameters and their uncertainty, goodness-of-fit, truncated data, fast optimization algorithms, machine learning, etc. This paper mentions Sunyaev-Zel’dovich effect, which name scared me but I’d like to emphasize that this kind nomenclature may hinder from understanding details but could not block any collaborations.

]]> 0
~ Avalanche(a,b) Sun, 14 Oct 2007 01:14:21 +0000 vlk arXiv:0710.1021), by Melatos, Peralta, & Wyithe]]> Avalanches are a common process, occuring anywhere that a system can store stress temporarily without “snapping”. It can happen on sand dunes and solar flares as easily as on the snow bound Alps.

Melatos, Peralta, & Wyithe (arXiv:0710.1021) have a nice summary of avalanche processes in the context of pulsar glitches. Their primary purpose is to show that the glitches are indeed consistent with an avalanche, and along the way they give a highly readable description of what an avalanche is and what it entails. Briefly, avalanches result in event parameters that are distributed in scale invariant fashion (read: power laws) with exponential waiting time distributions (i.e., Poisson).

Hence the title of this post: the “Avalanche distribution” (indulge me! I’m using stats notation to bury complications!) can be thought to have two parameters, both describing the indices of power-law distributions that control the event sizes, a, and the event durations, b, and where the event separations are distributed as an exponential decay. Is there a canned statistical distribution that describes all this already? (In our work modeling stellar flares, we assumed that b=0 and found that a>2 a<-2, which has all sorts of nice consequences for coronal heating processes.)

]]> 0