The AstroStat Slog » statistical inference

accessing data, easier than before but…

hlee — Tue, 20 Jan 2009 17:59:56 +0000

Someone emailed me for globular cluster data sets I used in a proceeding paper, which was about how to determine the multi-modality (multiple populations) based on well known and new information criteria without binning the luminosity functions. I spent quite time to understand the data sets with suspicious numbers of globular cluster populations. On the other hand, obtaining globular cluster data sets was easy because of available data archives such as VizieR. Most data sets in charts/tables, I acquire those data from VizieR. In order to understand science behind those data sets, I check ADS. Well, actually it happens the other way around: check scientific background first to assess whether there is room for statistics, then search for available data sets.

However, if you are interested in massive multivariate data or if you want to have a subsample from a gigantic survey project, impossible all to be documented in contrast to those individual small catalogs, one might like to learn a little about Structured Query Language (SQL). With nice examples and explanation, some Tera byte data are available from SDSS. Instead of images in fits format, one can get ascii/table data sets (variables of million objects are magnitudes and their errors; positions and their errors; classes like stars, galaxies, AGNs; types or subclasses like elliptical galaxies, spiral galaxies, type I AGN, type Ia, Ib, Ic, and II SNe, various spectral types, etc; estimated variables like photo-z, which is my keen interest; and more). Furthermore, thousands of papers related to SDSS are available to satisfy your scientific cravings. (Here are Slog postings under SDSS tag).

If you don’t want to limit yourself with ascii tables, you may like to check the quick guide/tutorial of Gator, which aggregated archives of various missions: 2MASS (Two Micron All-Sky Survey), IRAS (Infrared Astronomical Satellite), Spitzer Space Telescope Legacy Science Programs, MSX (Midcourse Space Experiment), COSMOS (Cosmic Evolution Survey), DENIS (Deep Near Infrared Survey of the Southern Sky), and USNO-B (United States Naval Observatory B1 Catalog). Probably, you also want to check NED or NASA/IPAC Extragalactic Database. As of today, the website said, 163 million objects, 170 million multiwavelength object cross-IDs, 188 thousand associations (candidate cross-IDs), 1.4 million redshifts, and 1.7 billion photometric measurements are accessible, which seem more than enough for data mining, exploring/summarizing data, and developing streaming/massive data analysis tools.

Probably, astronomers might wonder why I’m not advertising Chandra Data Archive (CDA) and its project oriented catalog/database. All I can say is that it’s not independent statistician friendly. It is very likely that I am the only statistician who tried to use data from CDA directly and bother to understand the contents. I can assure you that without astronomers’ help, the archive is just a hot potato. You don’t want to touch it. I’ve been there. Regardless of how painful it is, I’ve kept trying to touch it since It’s hard to resist after knowing what’s in there. Fortunately, there are other data scientist friendly archives that are quite less suffering compared to CDA. There are plethora things statisticians can do to improve astronomers’ a few decade old data analysis algorithms based on Gaussian distribution, iid assumption, or L₂ norm; and to reflect the true nature of data and more relaxed assumptions for robust analysis strategies than for traditionally pursued parametric distribution with specific models (a distribution free method is more robust than Gaussian distribution but the latter is more efficient) not just with CDA but with other astronomical data archives. The latter like vizieR or SDSS provides data sets which are less painful to explore with without astronomical software/package familiarity.

Computer scientists are well aware of UCI machine learning archive, with which they can validate their new methods with previous ones and empirically prove how superior their methods are. Statisticians are used to handle well trimmed data; otherwise we suggest strategies how to collect data for statistical inference. Although tons of data collecting and sampling protocols exist, most of them do not match with data formats, types, natures, and the way how data are collected from observing the sky via complexly structured instruments. Some archives might be extensively exclusive to the funded researchers and their beneficiaries. Some archives might be super hot potatoes with which no statistician wants to involve even though they are free of charges. I’d like to warn you overall not to expect the well tabulated simplicity of text book data sets found in exploratory data analysis and machine learning books.

Some one will raise another question why I do not speculate VOs (virtual observatories, click for slog postings) and Google Sky (click for slog postings), which I praised in the slog many times as good resources to explore the sky and to learn astronomy. Unfortunately, for the purpose of direct statistical applications, either VOs or Google sky may not be fancied as much as their names’ sake. It is very likely spending hours exploring these facilities and later you end up with one of archives or web interfaces that I mentioned above. It would be easier talking to your nearest astronomer who hopefully is aware of the importance of statistics and could offer you a statistically challenging data set without worries about how to process and clean raw data sets and how to build statistically suitable catalogs/databases. Every astronomer of survey projects builds his/her catalog and finds common factors/summary statistics of the catalog from the perspective of understanding/summarizing data, the primary goal of executing statistical analyses.

I believe some astronomers want to advertise their archives and show off how public friendly they are. Such advertising comments are very welcome because I intentionally left room for those instead of listing more archives I heard of without hands-on experience. My only wish is that more statisticians can use astronomical data from these archives so that the application section of their papers is filled with data from these archives. As if with sunspots, I wish that more astronomical data sets can be used to validate methodologies, algorithms, and eventually theories. I sincerely wish that this shall happen in a short time before I become adrift from astrostatistics and before I cannot preach about the benefits of astronomical data and their archives anymore to make ends meet.

There is no single well known data repository in astronomy like UCI machine learning archive. Nevertheless, I can assure you that the nature of astronomical data and catalogs bear various statistical problems and many of those problems have never been formulated properly towards various statistical inference problems. There are so many statistical challenges residing in them. Not enough statisticians bother to look these data because of the gigantic demands for statisticians from uncountably many data oriented scientific disciplines and the persistent shortage in supplies.

Signal Processing and Bootstrap

hlee — Wed, 30 Jan 2008 06:33:25 +0000

Astronomers have developed their ways of processing signals almost independent to but sometimes collaboratively with engineers, although the fundamental of signal processing is same: extracting information. Doubtlessly, these two parallel roads of astronomers’ and engineers’ have been pointing opposite directions: one toward the sky and the other to the earth. Nevertheless, without an intensive argument, we could say that somewhat statistics has played the medium of signal processing for both scientists and engineers. This particular issue of IEEE signal processing magazine may shed lights for astronomers interested in signal processing and statistics outside the astronomical society.

IEEE Signal Processing Magazine Jul. 2007 Vol 24 Issue 4: Bootstrap methods in signal processing

This link will show the table of contents and provide links to articles; however, the access to papers requires IEEE Xplore subscription via libraries or individual IEEE memberships). Here, I’d like to attempt to introduce some articles and tutorials.

Special topic on bootstrap:
The guest editors (A.M. Zoubir & D.R. Iskander)^[1] open the issue by providing the rationale, the occasional invalid Gaussian noise assumption, and the consequential complex modeling in their editorial opening, Bootstrap Methods in Signal Processing. A practical approach has been Monte Carlo simulations but the cost of repeating experiments is problematic. The suggested alternative is the bootstrap, which provides tools for designing detectors for various signals subject to noise or interference from unknown distributions. It is said that the bootstrap is a computer-intensive tool for answering inferential questions and this issue serves as tutorials that introduce this computationally intensive statistical method to the signal processing community.

The first tutorial is written by those two guest editors: Bootstrap Methods and Applications, which begins with the list of bootstrap methods and emphasizes its resilience. It discusses the number of bootstrap samples to compensate a simulation (Monte Carlo) error to a statistical error and the sampling methods for dependent data with real examples. The flowchart from Fig. 9 provides the guideline for how to use the bootstrap methods as a summary.

The title of the second tutorial is Jackknifing Multitaper Spectrum Estimates (D.J. Thomson), which introduces the jackknife, multitaper estimates of spectra, and applying the former to the latter with real data sets. The author added the reason for his preference of jackknife to bootstrap and discussed the underline assumptions on resampling methods.

Instead of listing all articles from the special issue, a few astrostatistically notable articles are chosen:

Bootstrap-Inspired Techniques in Computational Intelligence (R. Polikar) explains the bootstrap for estimating errors, algorithms of bagging, boosting, and AdaBoost, and other bootstrap inspired techniques in ensemble systems with a discussion of missing.
Bootstrap for Empirical Multifractal Analysis (H. Wendt, P. Abry & S. Jaffard) explains block bootstrap methods for dependent data, bootstrap confidence limits, bootstrap hypothesis testing in addition to multifractal analysis. Due to the personal lack of familiarity in wavelet leaders, instead of paraphrasing, the article’s conclusion is intentionally replaced with quoting sentences:

First, besides being mathematically well-grounded with respect to multifractal analysis, wavelet leaders exhibit significantly enhanced statistical performance compared to wavelet coefficients. … Second, bootstrap procedures provide practitioners with satisfactory confidence limits and hypothesis test p-values for multifractal parameters. Third, the computationally cheap percentile method achieves already excellent performance for both confidence limits and tests.
Wild Bootstrap Test (J. Franke & S. Halim) discusses the residual-based nonparametric tests and the wild bootstrap for regression models, applicable to signal/image analysis. Their test checks the differences between two irregular signals/images.
Nonparametric Estimates of Biological Transducer Functions (D.H.Foster & K.Zychaluk) I like the part where they discuss generalized linear model (GLM) that is useful to expend the techniques of model fitting/model estimation in astronomy beyond gaussian and least square. They also mentioned that the bootstrap is simpler for getting confidence intervals.
Bootstrap Particle Filtering (J.V.Candy) It is a very pleasant reading for Bayesian signal processing and particle filter. It overviews MCMC and state space model, and explains resampling as a remedy to overcome the shortcomings of importance sampling in signal processing.
Compressive sensing. (R.G.Baranuik)

A lecture note presents a new method to capture and represent compressible signals at a rate significantly below the Nyquist rate. This method employs nonadaptive linear projections that preserve the structure of the signal;

I do wish this brief summary assists you selecting a few interesting articles.

They wrote a book, the bootstrap and its application in signal processing.