The AstroStat Slog » nonparametric

[ArXiv] Voronoi Tessellations

hlee — Wed, 28 Oct 2009 14:29:24 +0000

As a part of exploring spatial distribution of particles/objects, not to approximate via Poisson process or Gaussian process (parametric), nor to impose hypotheses such as homogenous, isotropic, or uniform, various nonparametric methods somewhat dragged my attention for data exploration and preliminary analysis. Among various nonparametric methods, the one that I fell in love with is tessellation (state space approaches are excluded here). Computational speed wise, I believe tessellation is faster than kernel density estimation to estimate level sets for multivariate data. Furthermore, conceptually constructing polygons from tessellation is intuitively simple. However, coding and improving algorithms is beyond statistical research (check books titled or key-worded partially by computational geometry). Good news is that for computation and getting results, there are some freely available softwares, packages, and modules in various forms.

As a part of introducing nonparametric statistics, I wanted to write about applications of computation geometry from the nonparametric 2/3 dimensional density estimation perspective. Also, the following article came along when I just began to collect statistical applications in astronomy (my [ArXiv] series). This [arXiv] paper, in fact, initiated me to investigate Voronoi Tessellations in astronomy in general.

[arxiv/astro-ph:0707.2877]
Voronoi Tessellations and the Cosmic Web: Spatial Patterns and Clustering across the Universe
by Rien van de Weygaert

Since then, quite time has passed. In the mean time, I found more publications in astronomy specifically using tessellation as a main tool of nonparametric density estimation and for data analysis. Nonetheless, in general, topics in spatial statistics tend to be unrecognized or almost ignored in analyzing astronomical spatial data (I mean data points with coordinate information). Many seem only utilizing statistics partially or not at all. Some might want to know how often Voronoi tessellation is applied in astronomy. Here, I listed results from my ADS search by limiting tessellation in title key words. :

[arxiv/astro-ph:0110259]
Detecting Clusters of Galaxies in the Sloan Digital Sky Survey I : Monte Carlo Comparison of Cluster Detection Algorithms
by Kim, R.S.J. et al. (2002) AJ, 123, pp.20-36.
[arxiv/astro-ph:0906.1905]
The VOISE Algorithm: a Versatile Toll for Automatic Segmentation of Astronomical Images
by Guio, P. and Achilleos, N. (2009)
Using Voronoi Techniques to determine the shapes of photon sources
by Wilkinson and Meurs Irish Astronomical Journal, 1998, 25(1), 37
High-order 3D Voronoi tessellation for identifying isolated galaxies, pairs and triplets
by Elyiv, A.; Melnyk, O.; Vavilova, I. 2009..MNRAS..394..1409E
3-D Voronoi’s Tessellation as a Tool for Identifying Galaxy Groups
by Melnyk, Olga V.; Elyiv, Andrii A.; Vavilova, Iryna B. 2007..IAUS..235..223M
Adaptive binning of X-ray data with weighted Voronoi tessellations
by Diehl, Steven; Statler, Thomas S. 2006..MNRAS..368..497D
Adaptive spatial binning of integral-field spectroscopic data using Voronoi tessellations
by Cappellari, M. and Copin, Y. 2003..MNRAS..342..345C
Adaptive Spatial Binning of 2D Spectra and Images Using Voronoi Tessellations
by Cappellari, M.; Copin, Y. 2002..ASPC..282..515CA
Finding galaxy clusters using Voronoi tessellations
by Ramella, M.; Boschin, W.; Fadda, D.; Nonino, M. 2001..A&A…368..776R
The Forest Method as a New Parallel Tree Method with the Sectional Voronoi Tessellation
by Yahagi, Hideki; Mori, Masao; Yoshii, Yuzuru 1999..ApJS..124..1
Cluster Identification via Voronoi Tesselation ..1999..ASPC..176..108
The accuracy of parameters determined with the core-sampling method: Application to Voronoi tessellations 1997..A&AS..123..495
Dynamical Voronoi tessellation. V. Thickness and incompleteness.
by Zaninetti, L 1995..A&AS..109..71
Fragmenting the Universe. 3: The constructions and statistics of 3-D Voronoi tessellations
by van de Weygaert, Rien 1994..A&A..283..361
Dynamical Voronoi tessellation. IV. The distribution of the asteroids
by Zaninetti, L 1993..A&A..276..255
Quasi-periodic structures in the large-scale galaxy distribution and three-dimensional Voronoi tessellation
1991..MNRAS..250..519
Dynamical Voronoi tessellation. III – The distribution of galaxies
by Zaninetti, L 1991..A&A..246..291
Dynamical Voronoi tessellation. II – The three-dimensional case
by Zaninetti, L 1990..A&A..233..293
Dynamical Voronoi tessellation. I – The two-dimensional case
by Zaninetti, L 1989..A&A..224..345

Then, the topic has been forgotten for a while until this recent [arXiv] paper, which reminded me my old intention for introducing tessellation for density estimation and for understanding large scale structures or clusters (astronomers’ jargon, not the term in machine or statistical learning).

[arxiv:stat.ME:0910.1473] Moment Analysis of the Delaunay Tessellation Field Estimator
by M.N.M van Lieshout

Looking into plots of the papers by van de Weygaert or van Lieshout, without mathematical jargon and abstraction, one can immediately understand what Voronoi and Delaunay Tessellation is (Delaunay Tessellation is also called as Delaunay Triangulation (wiki). Perhaps, you want to check out wiki:Delaunay Tessellation Field Estimator as well). Voronoi tessellations have been adopted in many scientific/engineering fields to describe the spatial distribution. Astronomy is not an exception. Voronoi Tessellation has been used for field interpolation.

van de Weygaert described Voronoi tessellations as follows:

the asymptotic frame for the ultimate matter distribution,
the skeleton of the cosmic matter distribution,
a versatile and flexible mathematical model for weblike spatial pattern, and
a natural asymptotic result of an evolution in which low-density expanding void regions dictate the spatial organization of the Megaparsec universe, while matter assembles in high-density filamentary and wall-like interstices between the voids.

van Lieshout derived explicit expressions for the mean and variance of Delaunay Tessellatoin Field Estimator (DTFE) and showed that for stationary Poisson processes, the DTFE is asymptotically unbiased with a variance that is proportional to the square intensity.

We’ve observed voids and filaments of cosmic matters with patterns of which theory hasn’t been discovered. In general, those patterns are manifested via observed galaxies, both directly and indirectly. Individual observed objects, I believe, can be matched to points that construct Voronoi polygons. They represent each polygon and investigating its distributional properly helps to understand the formation rules and theories of those patterns. For that matter, probably, various topics in stochastic geometry, not just Voronoi tessellation, can be adopted.

There are plethora information available on Voronoi Tessellation such as the website of International Symposium on Voronoi Diagrams in Science and Engineering. Two recent meeting websites are ISVD09 and ISVD08. Also, the following review paper is interesting.

Centroidal Voronoi Tessellations: Applications and Algorithms (1999) Du, Faber, and Gunzburger in SIAM Review, vol. 41(4), pp. 637-676

By the way, you may have noticed my preference for Voronoi Tessellation over Delaunay owing to the characteristics of this centroidal Voronoi that each observation is the center of each Voronoi cell as opposed to the property of Delaunay triangulation that multiple simplices are associated one observation/point. However, from the perspective of understanding the distribution of observations as a whole, both approaches offer summaries and insights in a nonparametric fashion, which I put the most value on.

Scatter plots and ANCOVA

hlee — Thu, 15 Oct 2009 23:46:14 +0000

Astronomers rely on scatter plots to illustrate correlations and trends among many pairs of variables more than any scientists^[1]. Pages of scatter plots with regression lines are often found from which the slope of regression line and errors bars are indicators of degrees of correlation. Sometimes, too many of such scatter plots makes me think that, overall, resources for drawing nice scatter plots and papers where those plots are printed are wasted. Why not just compute correlation coefficients and its error and publicize the processed data for computing correlations, not the full data, so that others can verify the computation results for the sake of validation? A couple of scatter plots are fine but when I see dozens of them, I lost my focus. This is another cultural difference.

When having many pairs of variables that demands numerous scatter plots, one possibility is using parallel coordinates and a matrix of correlation coefficients. If Gaussian distribution is assumed, which seems to be almost all cases, particularly when parametrizing measurement errors or fitting models of physics, then error bars of these coefficients also can be reported in a matrix form. If one considers more complex relationships with multiple tiers of data sets, then one might want to check ANCOVA (ANalysis of COVAriance) to find out how statisticians structure observations and their uncertainties into a model to extract useful information.

I’m not saying those simple examples from wikipedia, wikiversity, or publicly available tutorials on ANCOVA are directly applicable to statistical modeling for astronomical data. Most likely not. Astrophysics generally handles complicated nonlinear models of physics. However, identifying dependent variables, independent variables, latent variables, covariates, response variables, predictors, to name some jargon in statistical model, and defining their relationships in a rather comprehensive way as used in ANCOVA, instead of pairing variables for scatter plots, would help to quantify relationships appropriately and to remove artificial correlations. Those spurious correlations appear frequently because of data projection. For example, datum points on a circle on the XY plane of the 3D space centered at zero, when seen horizontally, look like that they form a bar, not a circle, producing a perfect correlation.

As a matter of fact, astronomers are aware of removing these unnecessary correlations via some corrections. For example, fitting a straight line or a 2nd order polynomial for extinction correction. However, I rarely satisfy with such linear shifts of data with uncertainty because of changes in the uncertainty structure. Consider what happens when subtracting background leading negative values, a unrealistic consequence. Unless probabilistically guaranteed, linear operation requires lots of care. We do not know whether residuals y-E(Y|X=x) are perfectly normal only if μ and σs in the gaussian density function can be operated linearly (about Gaussian distribution, please see the post why Gaussianity? and the reference therein). An alternative to the subtraction is linear approximation or nonparametric model fitting as we saw through applications of principle component analysis (PCA). PCA is used for whitening and approximating nonlinear functional data (curves and images). Taking the sources of uncertainty and their hierarchical structure properly is not an easy problem both astronomically and statistically. Nevertheless, identifying properties of the observed both from physics and statistics and putting into a comprehensive and structured model could help to find out the causality^[2] and the significance of correlation, better than throwing numerous scatter plots with lines from simple regression analysis.

In order to understand why statisticians studied ANCOVA or, in general, ANOVA (ANalysis Of VAriance) in addition to the material in wiki:ANCOVA, you might want to check this page^[3] and to utilize your search engine with keywords of interest on top of ANCOVA to narrow down results.

From the linear model perspective, if a response is considered to be a function of redshift (z), then z becomes a covariate. The significance of this covariate in addition to other factors in the model, can be tested later when one fully fit/analyze the statistical model. If one wants to design a model, say rotation speed (indicator of dark matter occupation) as a function of redshift, the degrees of spirality, and the number of companions – this is a very hypothetical proposal due to my lack of knowledge in observational cosmology. I only want to point that the model fitting problem can be seen from statistical modeling like ANCOVA by identifying covariates and relationships – because the covariate z is continuous, and the degrees are fixed effect (0 to 7, 8 tuples), and the number of companions are random effect (cluster size is random), the comprehensive model could be described by ANCOVA. To my knowledge, scatter plots and simple linear regression are marginalizing all additional contributing factors and information which can be the main contributors of correlations, although it may seem Y and X are highly correlated in the scatter plot. At some points, we must marginalize over unknowns. Nonetheless, we still have some nuisance parameters and latent variables that can be factored into the model, different from ignoring them, to obtain advanced insights and knowledge from observations in many measures/dimensions.

Something, I think, can be done with a small/ergonomic chart/table via hypothesis testing, multivariate regression, model selection, variable selection, dimension reduction, projection pursuit, or names of some state of the art statistical methods, is done in astronomy with numerous scatter plots, with colors, symbols, and lines to account all possible relationships within pairs whose correlation can be artificial. I also feel that trees, electricity, or efforts can be saved from producing nice looking scatter plots. Fitting/Analyzing more comprehensive models put into a statistical fashion helps to identify independent variables or covariates causing strong correlation, to find collinear variables, and to drop redundant or uncorrelated predictors. Bayes factors or p-values can be assessed for comparing models, testing significance their variables, and computing error bars appropriately, not the way that the null hypothesis probability is interpreted.

Lastly, ANCOVA is a complete [MADS].

This is not an assuring absolute statement but a personal impression after reading articles of various fields in addition to astronomy. My readings of other fields tell that many rely on correlation statistics but less scatter plots by adding straight lines going through data sets for the purpose of imposing relationships within variable pairs
the way that chi-square fitting is done and the goodness-of-fit test is carried out is understood by the notion that X causes Y and by the practice that the objective function, the sum of (Y-E[Y|X])^2/σ^2 is minimized
It’s a website of Vassar college, that had a first female faculty in astronomy, Maria Mitchell. It is said that the first building constructed is the Vassar College Observatory, which is now a national historic landmark. This historical factor is the only reason of pointing this website to drag some astronomers attentions beyond statistics.

Goodness-of-fit tests

hlee — Tue, 06 Oct 2009 18:49:39 +0000

When it comes to applying statistics for measuring goodness-of-fit, the Pearson χ² test is the dominant player in a race and the Kolmogorov-Smirnoff test statistic trails far behind. Although it seems almost invisible in this race, there are more various non-parametric statistics for testing goodness-of-fit and for comparing the sampling distribution to a reference distribution as legitimate race participants trained by many statisticians. Listing their names probably useful to some astronomers when they find the underlying assumptions for the χ² test do not match the data. Perhaps, some astronomers want to try other nonparametric test statistics other than the K-S test. I’ve seen other test statistics in astronomical journals from time to time. Depending on data and statistical properties, one test statistic could work better than the other; therefore, it’s worthwhile to keep the variety in one’s mind that there are other tests beyond the χ² test goodness-of-fit test statistic.

This is the list I can think of at the moment and each test is linked to wikipedia for more stories.

Wilkoxson Rank Sum test (also called the Mann–Whitney U, Mann–Whitney–Wilcoxon (MWW), or Wilcoxon–Mann–Whitney test)
Wilcoxon signed-rank test
Anderson-Darling test
Cramer- von Mises test
Shapiro-Wilks test
Siegel-Tukey test

Before my updates, I welcome your comments that can grow this list. Also, I’d appreciate if your comment includes an explanation when the nonparametric test of your recommendation works better and a little description of your data characteristics. And don’t forget to get the qq-plot prior to discussing implications of p-values from these test statistics.

[MADS] data depth

hlee — Tue, 02 Jun 2009 02:51:29 +0000

How would you assign orders to multivariate data? If you have your strategy to achieve this ordering task, I’d like to ask, “is your strategy affine invariant?” meaning that shift and rotation invariant.

Let’s make the question a bit concrete. What would be the largest point from a scatter plot below? Thirty random bivariate data were generated from bivariate normal distribution (mu=[1,1], sigma=[[1,-.9],[-.9,1]]). If you have applied your rule of assigning ranks to some largest or smallest points, could you continue to apply it to all data points? Once you’ve done it, shift your data w.r.t. the mean (not mu; although I told you mu, data do not tell what true mu is but we can estimate mu and assign uncertainty on that estimate via various methods, beyond computing sample standard deviation or traditional chi-square methods for more complex data and models. Mean, a solution to the chi-square method or a least square method to the hypothesis (model) mu=\bar{X}, is only one estimator among many estimators) and rotate them clock-wise by 90 degrees. Does each point preserve the same rank when you apply your ranking method to the shifted and rotated multivariate data?

First, if you think about coordinate-wise ranking, you’ll find out that two sets of ranks are not matching. Additional questions I like to add are, which axis you would consider first? X or Y? and how to handle ties in one coordinate (since bivariate normal, a continuous distribution, the chance of ties is zero but real data often show ties).

The notion of Data Depth is not employed in astronomy as well as other fields because variables have physical meaning so that multidimensional data cloud is projected onto the most important axis and from which luminosity functions/histograms are drawn to discover a trend, say a clump at a certain color or a certain redshift indicating an event, epoch, or episode in formation history. I wonder if they truly look like cluttered when they are deprojected into the original multi-dimensional space. Remember the curse of dimensionality. Even though the projected data show bumps, in multidimensional space, a hypothesis testing will conclude complete spatial randomness, or there is no clump. Consider transformations and distance metrics. Or how nearest neighbors are defined. Simply, consider mathematical distances of pairs of galaxy for statistical analysis, and these galaxies have location metrics: longitude, latitude, and redshift (3D spatial data points), and scales such as magnitudes, group names, types, shapes, and classification metrics, which cannot be used directly into computing Euclidean distance. Choices of metrics and transformation are different from mathematics and algorithmic standpoints, and from practical and astronomical viewpoints to assign ranks to data. The importance of ordering data is 1. identifying/removing outliers, 2. characterizing data distribution quickly, for example, current contour plots can be accelerated, 3. updating statistics efficiently – only looking into neighbors instead of processing whole data.

The other reason that affine invariant ordering of multivariate data finds no use in astronomy can be attributed to its nonparametric methodology nature. So far, my impression is that nonparametric statistics is a relatively unknown territory to astronomers. Unless a nonparametric statistic is designed for hypothesis testings and interval estimations, this nonparametric method does not give confidence levels/intervals that astronomers always like to report, “how confident is the result of ordering multivariate data or what is the N sigma uncertainty that this data point is the center within plus minus 2?” Nonparametric bootstrap or jackknife are rarely used for computing uncertainties to compensate the fact that a nonparametric statistic only offers a point estimate (best fit). Regardless of the lack of popularity in these robust point estimators from nonparametric statistics, the notion of data depth and nonparametric methods, I believe, would help astronomers to retrieve summary statistics quickly on multidimensional massive and streaming data from future surveys. This is the primary reason to introduce and advertise data depth in this post.

Data depth is a statistic that measures ranks of multivariate data. It assigns a numeric value to a vector (data point) based on its centrality relative to a data cloud. The statistical depth satisfies the following conditions:

affine invariance
maximality at the center: data depth-wise median pertains highest rank in univariate data, the datum of maximum value.
monotonicity
depth goes zero when euclidean distance of the datum goes to infinity

Nowadays, observation in several bands is more typical than observing in two bands or three, at most. The structures such as bull’s eye or god’s fingers are not visible in 1D projection if location coordinates are considered instead of colors. In order to understand such multidimensional data, as indicated above, this [MADS], data depth might enhance the efficiency of data pipe-lining and computing summary statistics. It’s related to handling multivariate outliers and storing/discarding unnecessary/unscientific data. Data points on the surface of data clouds or tiny parasite like data clumps accompanying the important data clouds are not visible nor identifiable when the whole data set is projected onto one or two dimensions. We can define a train of data summary from data depths matching with traditional means, standard devidations, skewness, and kurtosis measures. If a datum and a tiny data cloud does not belong to that set, we can discard those outliers with statistical reasons and separate them from the main archive/catalog. I’d like to remind you that a star of modest color does not necessary belong to the main sequence, if you use projected 1D histogram onto a color axis, you would not know this star is a pre-main sequence star, a giant star, or a white dwarf. This awareness of seeing a data set in a multidimensional space and adapting suitable statistics of multivariate data analysis is critical when a query is executed on massive and streaming data. Otherwise, one would get biased data due to coordinate wise trimming. Also, data depth based algorithmic efficient methods could save wasting cpu time, storage, and human labor for validation.

As often we see data trimming happens based on univariate projection and N sigma rules. Such trimming tends to remove precious data points. I want to use a strategy to remove parasites, not my long tail. As a vertebrate is the principle axis, tail and big nose is outlier than parasites swimming near your belly. Coordinate wise data trimming with N sigma rules will chop the tail and long nose first, then later they may or may not remove parasites. Without human eye verification, say virtual observers (people using virtual observatories for their scientific researches) are likely to receive data with part of head and tail missing but those parasites are not removed. This analogy is applicable to fields stars to stellar clusters, dwarf galaxies to a main one, binaries to single stars, population II vs. population I stars, etc.
Outliers are not necessarily bad guys. Important discoveries were made by looking into outliers. My point is that identifying outliers with respect to multidimensional data distribution is important instead of projecting data onto the axis of a vertebrate.

In order to know further about data depth, I recommend these three papers, long but crispy and delicious to read.

Multivariate analysis by data depth: Descriptive statistics, graphics and inference (with discussion)
R.Liu, J. Parelius, and K. Singh (1999) Ann. Stat. vol. 27 pp. 783-858
(It’s a comprehensive survey paper on data depth)
Mathematics and the Picturing of Data
Tukey, J.W. (1974) Proceedings of the International Congress of Mathematicians, 2, 523-531
The Ordering of Multivariate Data
Barnett, V. (1976) J. R. Stat. Soc. A, 139(3), 318-355
(It answers more than three decades ago how to order multivariate data.
You might want to check publications by R. Serfling (scroll down a bit).

As we have various distance metrics, there are various statistical depths I already mentioned one in my [MADS] Mahalanobis distance Here is a incomplete list of data depths that I heard of:

mahalanobis depth
simplicial depth,
half space depth (Tukey depth)
Oja depth
convex hull peeling depth
projection depth
majority depth

I do not know the definitions of all depths and how to compute depths based on them (General notions of statistical depth function Serfling and Zuo (2000, Annals of Statistics, 28, pp.461-482) is very useful to get theoretical understanding about statistical data depth). What I do know is that except the Mahalanobis depth, the computational complexity of other depths is relatively simple. The reason for excluding the Mahalanobis depth is that it requires estimating large covariance matrix and inverting it. Therefore, ordering multivariate data with these data depth measures, including assigning medians and quantiles, statistically, that is more robust than coordinate-wise mean or additive errors. Also, removing outliers based on statistical depths pertains more statistical senses but less burdensome for computers and human laborers. Speaking of contouring, having assigned ranks by numeric values on each point, by collecting the multivariate data points of a certain level of ranks will lead to drawing a quick contour.

The story turned out to be longer than I expected, but the one take home message is

Q:How would you assign orders to multivariate data?
A:Statisticians have discussed the notion of data depth to answer the question.

———————————————————————
NOTE: The usages of depth in statistics and astronomy are different. In astronomy, depth is associated with the distance that a telescope can reach. In fact, it’s an euphemism. To be more precise, the depth of a telescope is somewhat defined by how luminously dim away that a telescope cannot catch photons from that source. Deep sky surveys imply that the telescope(s) in the survey projects can see deeper than the previous ones. Or more dim objects than previous telescopes. In statistics, depth is a bit abstract and conceptual based on definition. Data depth is multivariate version of quantile for univariate data on which one can easily order data points from the smallest to largest.

Robust Statistics

hlee — Mon, 18 May 2009 17:18:09 +0000

My understandings of “robustness” from the education in statistics and from communicating with astronomers are hard to find a mutual interest. Can anyone help me to build a robust bridge to get over this abyss?

First, since it’s short, let’s quote a comment from an astronomer that might reflect the notion of robust statistics in astronomy.

Bayesian is robust.

Is every Bayesian method robust and its counter part from classical statistics is not robust? I know that popular statistics in astronomy are not, generally speaking, robust and those popular statistics were borne before the notion of robustness in statistics were recognized and discussed.

I do understand why such partisan comment was produced. Astronomers always reports their data analysis results by best fits, error bars, probability, or some significance levels (they don’t say explicitly, p-values, powers, type I or type II errors, unbiased estimates, and other statistical jargon in inference problems) and those classical methods of frequent use have caused frustrations due to their lack of robustness. On the contrary, MCMC algorithms for estimating posterior distributions produce easy interpretable results to report best fit (mode) and error bar (HPD).

My understanding of robustness as a statistician does not draw a line between Bayesian and frequenstists. The following is quoted from the Katholieke Universiteit Leuven website of which mathematics department has a focus group for robust statistics.

Robust statistical methods and applications.
The goal of robust statistics is to develop data analytical methods which are resistant to outlying observations in the data, and hence which are also able to detect these outliers. Pioneering work in this area has been done by Huber (1981), Hampel et al. (1986) and Rousseeuw and Leroy (1987). In their work, estimators for location, scale, scatter and regression play a central role. They assume that the majority of the data follow a parametric model, whereas a minority (the contamination) can take arbitrary values. This approach leads to the concept of the influence function of an estimator which measures the influence of a small amount of contamination in one point. Other measures of robustness are the finite-sample and the asymptotic breakdown value of an estimator. They tell what the smallest amount of contamination is which can carry the estimates beyond all bounds.

Nowadays, robust estimators are being developed for many statistical models. Our research group is very active in investigating estimators of covariance and regression for high-dimensional data, with applications in chemometrics and bio-informatics. Recently, robust estimators have been developed for PCA (principal component analysis), PCR (principal component regression), PLS (partial least squares), classification, ICA (independent component analysis) and multi-way analysis. Also robust measures of skewness and tail weight have been introduced. We study robustness of kernel methods, and regression quantiles for censored data.

My understanding of “robustness” from statistics education is pandemic, covers both Bayesian and frequentist. Any methods and models that are insensitive or immune to outliers, are robust methods and statistics. For example, median is more robust than mean since the breakpoint of median is 1/2 and that of mean is 0, asymptotically. Both mean and median are estimable from Bayesian and frequentist methods. Instead of standard deviation, tactics like lower and upper quartiles to indicate error bars or Winsorization (or trim) to obtain a standard deviation for the error bar, are adopted regardless of Bayesian or frequenstist. Instead of the chi square goodness-of-fit tests, which assume Gaussian residuals, nonparametrics tests or distribution free tests can be utilized.

The notion that frequentist methods are not robust might have been developed from the frustration that those chi-square related methods in astronomy do not show robust characteristics. The reason is that data are prone to the breaks of the normality (Gaussianity) assumption. Also, the limited library of nonparametric methods in data analysis packages and softwares envisions that frequentist methods are not robust. An additional unattractive aspect about frequentist methods is that the description seems too mathematical, too abstract, and too difficult to be codified with full of unfriendly jargon whereas the Bayesian methods provide step by step modeling procedures with explanation why they chose such likelihood and such priors based on external knowledge from physics and observations (MCMC algorithms in the astronomical papers are adaptation of already proven algorithms from statistics and algorithm journals).

John Tukey said:

Robustness refers to the property of a procedure remaining effective even in the absence of usual assumptions such as normality and no incorrect data values. In simplest terms the idea is to improve upon the use of the simple arithmetic average in estimating the center of a distribution. As a simple case one can ask: Is it ever better to use the sample median than the samle mean, and if so, when?

I don’t think the whole set of frequentist methods is the complement set of Bayesians. Personally I feel quite embarrassed whenever I am told that frequentist methods are not robust compared to Bayesian methods. Bayesian methods become robust when a priori knowledge (subjective priors) allows the results to be unaffected by outliers with a general class of likelihood. Regardless of being frequentist or Bayesian, statistics have been developed to be less sensitive to outliers and to do optimal inferences, i.e. to achieve the goal, robustness.

Ah…there are other various accounts for robustness methods/statistics in astronomy not limited to “bayesian is robust.” As often I got confused whenever I see statistically rigorous while the method is a simple chi-square minimization in literature, I believe astronomers can educate me the notion of robustness and the definition of robust statistics in various ways. I hope for some sequels from astronomers about robustness and robust statistics to this post of limited view.

[MADS] Semiparametric

hlee — Mon, 09 Feb 2009 19:16:05 +0000

There were (only) four articles from ADS whose abstracts contain the word semiparametric (none in titles). Therefore, semiparametric is not exactly [MADS] but almost [MADS]. One would like to say it is virtually [MADS] or quasi [MADS]. By introducing the term and providing rare examples in astronomy, I hope this scarce term semiparametric to be used adequately against its misguidance of astronomers to inappropriate usage for statistical inference with their data.

[2006MNRAS.369.1334S]: semiparametric technique based on a maximum likelihood (ML) approach and Voronoi tessellation (VT). Besides, I wonder if Section 3.3, the cluster detection algorithm works similarly to a source detection algorithm in high energy astrophysics if tight photon clusters indicate sources. By the way, what is the definition of sources? Depending on the definitions, determining the right thresholds for detections would change; however, it seems like (brute) Monte Carlo simulations i.e. empirical approaches are employed for setting thresholds. Please, note that my questionnaire is irrelevant to this paper, which I enjoyed reading very much.
[2004MNRAS.347.1241S]: similar to the above because of the same methodology, ML, VT, and color slide/filter for cluster detection
[2002AJ....123.1807G]: cut and enhance (CE) cluster detection method. From the abstract: The method is semiparametric, since it uses minimal assumptions about cluster properties in order to minimize possible biases. No assumptions are made about the shape of clusters, their radial profile, or their luminosity function. On the contrary, I wish they used nonparametric which seems more proper in a statistical sense instead of semiparametric judging from their methodology description.
[2002A%26A...383.1100N]: statistics related keywords: time series; discrete Fourier transform; long range dependence; log-periodogram regression; ordinary least squares; generalized least squares. The semiparametric method section seems too short. Detail accounts are replaced by reference papers from Annals of Statistics. Among 31 references, 15 were from statistics journals and without reading them, average readers will not have a chance to understand the semiparametric approach.

You might want to check out wiki:Semiparametric about semiparametric (model) from the statistics standpoint.

The following books that I checked from libraries some years back related to semiparametric methods, from which you could get more information about semeparametric statistics. Unfortunately, applications and examples in these books are heavily rely on subjects such as public health (epidemiology), bioinformatics, and econometrics.

Rupert, Wand, and Carroll (2003) Semiparametric Regression, Cambridge University Press
Härdle, Müller, Sperlich, and Werwatz (2004) Nonparametric and Semiparametric Models, Spinger
Horowitz (1998) Semiparametric Methods in Econometrics (Lecture Notes in Statistics) , Springer

There seem more recent publications from 2007 and 2008 about semiparametric methods, targeting diverse but focused readers but no opportunities for me to have a look on them. I just want to point out that many occasions we confront that full parametrization of a model is not necessary but those nuisance parameters determines the shape of a sampling distribution for accurate statistical inference. Semiparametric methods described in above papers are very limited from statistics viewpoints. Astronomers can take a way more advantages from various semiparametrical strategies. There are plenty of rooms for developing semiparametric approaches to various astronomical data analysis and inference about the parameters of interest. It is almost unexplored.

missing data

hlee — Mon, 27 Oct 2008 13:24:22 +0000

The notions of missing data are overall different between two communities. I tend to think missing data carry as good amount of information as observed data. Astronomers…I’m not sure how they think but my impression so far is that a missing value in one attribute/variable from a object/observation/informant, all other attributes related to that object become useless because that object is not considered in scientific data analysis or model evaluation process. For example, it is hard to find any discussion about imputation in astronomical publication or statistical justification of missing data with respect to inference strategies. On the contrary, they talk about incompleteness within different variables. Putting this vague argument with a concrete example, consider a catalog of multiple magnitudes. To draw a color magnitude diagram, one needs both color and magnitude. If one attribute is missing, that star will not appear in the color magnitude diagram and any inference methods from that diagram will not include that star. Nonetheless, one will trying to understand how different proportions of stars are observed according to different colors and magnitudes.

I guess this cultural difference is originated from the quality of data. Speaking of typical size of that data sets that statisticians handle, a child can count the number of data points. The size of astronomical data, only rounded numbers of stars in the catalog are discussed and dropping some missing data won’t affect the final results.

Introducing how statisticians handle missing data may benefit astronomers who handles small catalogs due to observational challenge in the survey. Such data with missing values can be put into statistically rigorous data analysis processes in stead of ad hoc procedures of obtaining complete cases that risk throwing many data points.

In statistics, utilizing information of missing data enhances information toward the direction that the inference method tries to retrieve. Despite larger, it’s better to have error bars than nothing. My question is what are statistical proposals for astronomers to handle missing data? Even though I want to find such list, instead, I give a few somewhat nontechnical papers that explain the following missing data types in statistics and a few statistics books/articles that statisticians often cite.

Data mining and the impact of missing data by M.L. Brown and J.F.Kros, Industrial Management and Data Systems (2003) Vol. 103, No. 8, pp.611-621
Missing Data: Our View of the State of the Art by J.L.Schafer and J.W.Graham, Psychological Methods (2002) Vol.7, No. 2, pp. 147-177
Missing Data, Imputation, and the Bootstrap by B. Efron, JASA (1984) 89 426 p. 463- and D.B.Rubin’s comment
The multiple imputation FAQ page (web) by J. Shafer
Statistical Analysis with Missing Data by R.J.A. Little and D.B.Rubin (2002) 2nd ed. New York: Wiley.
The Curse of the Missing Data (web) by Yong Kim
A Review of Methods for Missing Data by T.D.Pigott, Edu. Res. Eval. (2001) 7(4),pp.353-383 (survey of missing data analysis strategies and illustration with “asthma data”)

Pigott discusses missing data methods to general audience in plain terms under the following categories: complete-cases, available-cases, single-value imputation, and more recent model-based methods, maximum likelihood for multivariate normal data, and multiple imputation. Readers of craving more information see Schafer and Graham or books by Schafer (1997) and Little and Rubin (2002).

Most introductory articles begin with common assumptions like missing at random (MAR) or missing at completely random (MCAR) but these seem not apply to typical astronomical data sets (I don’t know exactly why yet – I cannot provide counter examples to prove – but that’s what I have observed and was told). Currently, I like to find ways to link between statistical thinking about missing data and modeling to astronomical data of missing through discovering commonality in their missing properties). I hope you can help me and others of such efforts. For your information, the following are the short definitions of these assumptions:

data missing at random : missing for reasons related to completely observed variables in the data set
data missing completely at random : the complete cases are a random sample of the originally identified set of cases
non-ignorable missing data : the reasons for the missing observations depend on the values of those variables.
outliers treated as missing data
the assumption of an ignorable response mechanism.

Statistical researches are conducted traditionally under the circumstance that complete data are available and the goal is characterizing inference results from the missing data analysis methods by comparing results from data with complete information and dropping observations on the variables of interests. Simulations enable to emulate these different kind of missing properties. A practical astronomer may raise a question about such comparison and simulating missing data. In real applications, such step is not necessary but for the sake of statistical/theoretical authenticity/validation and approval of new missing data analysis methods, the comparison between results from complete data and missing data is unavoidable.

Against my belief that statistical analysis with missing data is applied universally, it seems like only regression type strategy can cope with missing data despite the diverse categories of missing data, so far. Often cases in multivariate data analysis in astronomy, the relationship between response variables and predictors is not clear. More frequently, responses do not exist but the joint distribution of given variables is more cared. Without knowing data generating distribution/model, analyzing arbitrarily built models with missing data for imputation and for estimation seems biased. This gap of handling different data types is the motivation of introducing statistical missing data analysis to astronomers, but statistical strategies of handing missing data may be seen very limited. I believe, however, some “new” concepts in missing data analysis approaches can be salvaged like the assumptions for analyzing data with underlying multivariate normal distribution, favored by astronomers many of whom apply principle component analysis (PCA) nowadays. Understanding conditions for multivariate normal distribution and missing data more rigorously leads astronomers to project their data analysis onto the regression analysis space since numerous survey projects in addition to the emergence of new catalogs pose questions of relationships among observed variables or estimated parameters. The broad areas of regression analysis embraces missing data in various ways and likewise, vast astronomical surveys and catalogs need to move forward in terms of adopting proper data analysis tools to include missing data since instead of laws of physics, finding relationships among variables empirically is the scientific objective of surveys, and missing data are not ignorable. I think that tactics in missing data analysis will allow steps forward in astronomical data analysis and its statistical inference.

Statisticians or other scientists utilizing statistics might have slightly different ways to call the strategies of missing data analysis, my way of putting the strategies of missing data analysis described in above texts is as follows:

complete case analysis (caveat: relatively few cases may be left for the analysis and MCAR is assumed),
available case analysis (pairwise deletion, delete selected variables. caveat: correlations in variable pairs)
single-value imputation (typically mean value is imputed, causing biased results and underestimated variance, not recommended. )
maximum likelihood, and
multiple imputation (the last two are based on two assumptions: multivariate normal and ignorable missing data mechanism)

and the following are imputation strategies:

mean substituion,
case substitution (scientific knowledge authorizes substitution),
hot deck imputation (external sources imputes imputation),
cold deck imputation (values drawn from the next most similar case but difficulty in defining what is “similar”),
regression imputation (prediction with independent variables and mean imputation is a special case) and
multiple imputation

Some might prefer the following listing (adopted from Gelman and Brown’s regression analysis book):

simple missing data approaches that retain all the data

-mean imputation
-last value carried forward
-using information from related observation
-indicator variables for missingness of categorical predictors
-indicator varibbles for missingness of continuous predictors
-imputation based on logical values

random imputation of a single variables
imputation of several missing variables
model based imputation
combining inferences from multiple imputation

Explicit assumptions are acknowledged through statistical missing data analysis compared to subjective data processing toward complete data set. I often see discrepancies between plots from astronomical journals and linked catalogs where missing data including outliers reside but through the subjective data cleaning step they do not appear in plots. On the other hand, statistics exclusively explains assumptions and conditions of missing data. However, I don’t know what is proper or correct from scientific viewpoints. Such explication does not exist and judgments on assumptions on missing data and processing them left to astronomers. Moreover, astronomers have the advantages like knowledge in physics for imputing data more suitably and subtly.

Schafer and Graham described, with or without missing data, the goal of a statistical procedure should be to make valid and efficient inferences about a population of interest — not to estimate, predict, or recover missing observations nor to obtain the same results that we would have seen with complete data.

The following quote from the above web link (Y. Kim) says more.

Dealing with missing data is a fact of life, and though the source of many headaches, developments in missing data algorithms for both prediction and parameter estimation purposes are providing some relief. Still, they are no substitute for critical planning. When it comes to missing data, prevention is the best medicine.

Missing entries in astronomical catalogs are unpreventable; therefore, one needs statistically improved strategies more than ever because of the increase volume of surveys and catalogs proportionally many missing data reside. Or current methods using complete data (getting rid of all observations with at least one missing entry) could be the only way to go. There are more rooms left to discuss strategies case by case, which would come in future post. This one is already too long.

[ArXiv] 2nd week, Nov. 2007

hlee — Fri, 09 Nov 2007 16:45:01 +0000

There should be at least one paper that drags your attention. Various statistics topics appeared in astro-ph this week.

[astro-ph:0711.0330] Assessing statistical significance of periodogram peaks by R. V. Baluev
[astro-ph:0711.0435] Bayesian approach for g-mode detection, or how to restrict our imagination by T.Appourchaux
[astro-ph:0711.0500] A search for transiting extrasolar planet candidates in the OGLE-II microlens database of the galactic plane by I. Snellen et. al.
[astro-ph:0711.0537] A new wavelet-based approach for the automated treatment of large sets of lunar occultation data by O. Fors et. al.
[astro-ph:0711.0962] Photometric Redshift Error Estimators by H. Oyaizu et. al.
[astro-ph:0711.0270] Two-point correlation properties of stochastic “cloud processes” by A Gabrielli and M. Joyce
[stat.TH:0711.0883] Data-driven wavelet-Fisz methodology for nonparametric function estimation by P. Fryzlewicz
[stat.ME:0711.0458] Bayesian finite mixtures: a note on prior specification and posterior computation by A. Nobile
[astro-ph:0711.0989] Color Dependence in the Spatial Distribution of Satellite Galaxies by J. Chen

[ArXiv] An unbiased estimator, May 29, 2007

hlee — Tue, 30 Oct 2007 07:37:07 +0000

From arxiv/astro-ph:0705.4199v1
In search of an unbiased temperature estimator for statistically poor X-ray spectra
A. Leccardi and S. Molendi

There was a delay of writing about this paper, which by accident was lying under the pile of papers irrelevant to astrostatistics. (It has been quite overwhelming to track papers with various statistical applications and papers with rooms left for statistical improvements from arxiv:astro-ph). Although there is a posting about this paper (see Vinay’s posting), I’d like to give a shot. I was very excited because I haven’t seen any astronomical papers discussing unbiased estimators solely.

By the same token that the authors discussed bias in the χ^2 method and the maximum likelihood estimator, we know that the χ^2 function is not always symmetric for applying Δχ^2 =1 for a 68% confidence interval. The nominal level interval from the Δχ^2 method does not always provide the nominal coverage when the given model to be fitted does not satisfy the (regularity) conditions for approximating χ^2 distribution. The χ^2 best fit does not always observe the (in probability or almost sure) convergence to the true parameter, i.e. biased so that the coverage level misleads the information of the true parameter. The illustration of the existence of bias in traditional estimators in high energy astronomy is followed by authors’ proposals of unbiased (temperature) estimators via (variable) transformation.

Transformation is one way of reducing bias (e.g. Box-Cox transformation or power transformation is a common practice in introductory statistics to make residuals more homogeneous). Transformation leads an asymmetric distribution to (asymptotically) symmetric. Different from the author’s comment (the parametric bootstrap reached no improvement in bias reduction), reducing bias from computing likelihoods (Cash statistics) can be achieved by statistical subsampling methods, like cross-validation, jackknife, and bootstrap upon careful designs of subsampling schemes (instead of parametric bootstrap, nonparametric bootstrap could yield a different conclusion). Penalized likelihood, instead of L_2 norm (the χ^2 measure is L_2), L_1 norm penalty helps to reduce bias as well.

One of the useful discussions about unbiased estimators is the comparison between the χ^2 best fit method and Cash statistics (Maximum Poisson Likelihood Estimator). Overall, Cash statistics excels the χ^2 best fit method. Neither of these two methods overcome bias from low counts, small exposure time, background level, and asymmetry pdf (probability density function) in T(temperature), their parameter of interest. Their last passage to obtain an unbiased estimator was taking a nonparametric approach to construct a mixture model from three pdf’s to estimate the uncertainty. They concluded the results from the mixing distributions were excellent. This mixing distributions takes an effect of reducing errors by averaging. Personally, their saying “the only method that returns the expected temperature under very different situations” seems to be overstated. Either designing more efficient mixing distributions (unequal weighting triplets than equal weights) or defining M-estimators upon understanding three EPIC instruments would produce better degrees of unbiasedness.

Note that the maximum likelihood estimator (MLE) is a consistent estimator (asymptotically unbiased) under milder regularity conditions in contrast to the χ^2 best fit estimator. Instead of stating that MLE can be biased, it would have been better to discuss the suitability of regularity conditions to source models built on Poisson photon counts for estimating temperatures and XSPEC estimation procedures.

Last, I’d like to quote their question as it is:

What are the effects of pure statistical uncertainties in determining interesting parameters of highly non linear models (e.g. the temperature of th ICM), when we analyze spectra accumulated from low surface brightness regions using current X-ray experiments?

Although the authors tried to answer this question, my personal opinion is that they were not able to fully account the answer but left a spacious room for estimating statistical uncertainty and bias rigorously in high energy astrophysics with more statistical care (e.g. instead of MLE or Cash statistics, we could develop more robust but unbiased M-estimator).

Astrostatistics: Goodness-of-Fit and All That!

hlee — Wed, 15 Aug 2007 02:17:00 +0000

During the International X-ray Summer School, as a project presentation, I tried to explain the inadequate practice of χ^2 statistics in astronomy. If your best fit is biased (any misidentification of a model easily causes such bias), do not use χ^2 statistics to get 1σ error for the 68% chance of capturing the true parameter.

Later, I decided to do further investigation on that subject and this paper came along: Astrostatistics: Goodness-of-Fit and All That! by Babu and Feigelson.

First, the authors pointed out that the χ^2 method 1) is inappropriate when errors are non-gaussian, 2) does not provide clear decision procedures between models with different numbers of parameters or between acceptable models, and 3) is possibly difficult to obtain confidence intervals on parameters when complex correlations between the parameters are present. As a remedy to the χ^2 method, they introduced distribution free tests, such as Kolmogorov-Smirnoff (K-S) test, Cramer-von Mises (C-vM) test, and Anderson-Darling (A-D) test. Among these distribution free tests, the K-S test is well known to astronomers but it has been ignored that the results from these tests become unreliable when the data come from a multivariate distribution. Furthermore, K-S tests fail when the data set is used for parameter estimation and computing the empirical distribution function.

The authors proposed resampling schemes to overcome the above shortcomings by showing both parametric and nonparametric bootstrap methods, and advanced to model comparison particularly when models are not nested. The best fit model can be chosen among other candidate models based on their distances (e.g. Kullback-Leibler distance) to the unknown hypothetical true model.