The AstroStat Slog » MADS http://hea-www.harvard.edu/AstroStat/slog Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 http://wordpress.org/?v=3.4 Scatter plots and ANCOVA http://hea-www.harvard.edu/AstroStat/slog/2009/scatter-plots-and-ancova/ http://hea-www.harvard.edu/AstroStat/slog/2009/scatter-plots-and-ancova/#comments Thu, 15 Oct 2009 23:46:14 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1640 Astronomers rely on scatter plots to illustrate correlations and trends among many pairs of variables more than any scientists[1]. Pages of scatter plots with regression lines are often found from which the slope of regression line and errors bars are indicators of degrees of correlation. Sometimes, too many of such scatter plots makes me think that, overall, resources for drawing nice scatter plots and papers where those plots are printed are wasted. Why not just compute correlation coefficients and its error and publicize the processed data for computing correlations, not the full data, so that others can verify the computation results for the sake of validation? A couple of scatter plots are fine but when I see dozens of them, I lost my focus. This is another cultural difference.

When having many pairs of variables that demands numerous scatter plots, one possibility is using parallel coordinates and a matrix of correlation coefficients. If Gaussian distribution is assumed, which seems to be almost all cases, particularly when parametrizing measurement errors or fitting models of physics, then error bars of these coefficients also can be reported in a matrix form. If one considers more complex relationships with multiple tiers of data sets, then one might want to check ANCOVA (ANalysis of COVAriance) to find out how statisticians structure observations and their uncertainties into a model to extract useful information.

I’m not saying those simple examples from wikipedia, wikiversity, or publicly available tutorials on ANCOVA are directly applicable to statistical modeling for astronomical data. Most likely not. Astrophysics generally handles complicated nonlinear models of physics. However, identifying dependent variables, independent variables, latent variables, covariates, response variables, predictors, to name some jargon in statistical model, and defining their relationships in a rather comprehensive way as used in ANCOVA, instead of pairing variables for scatter plots, would help to quantify relationships appropriately and to remove artificial correlations. Those spurious correlations appear frequently because of data projection. For example, datum points on a circle on the XY plane of the 3D space centered at zero, when seen horizontally, look like that they form a bar, not a circle, producing a perfect correlation.

As a matter of fact, astronomers are aware of removing these unnecessary correlations via some corrections. For example, fitting a straight line or a 2nd order polynomial for extinction correction. However, I rarely satisfy with such linear shifts of data with uncertainty because of changes in the uncertainty structure. Consider what happens when subtracting background leading negative values, a unrealistic consequence. Unless probabilistically guaranteed, linear operation requires lots of care. We do not know whether residuals y-E(Y|X=x) are perfectly normal only if μ and σs in the gaussian density function can be operated linearly (about Gaussian distribution, please see the post why Gaussianity? and the reference therein). An alternative to the subtraction is linear approximation or nonparametric model fitting as we saw through applications of principle component analysis (PCA). PCA is used for whitening and approximating nonlinear functional data (curves and images). Taking the sources of uncertainty and their hierarchical structure properly is not an easy problem both astronomically and statistically. Nevertheless, identifying properties of the observed both from physics and statistics and putting into a comprehensive and structured model could help to find out the causality[2] and the significance of correlation, better than throwing numerous scatter plots with lines from simple regression analysis.

In order to understand why statisticians studied ANCOVA or, in general, ANOVA (ANalysis Of VAriance) in addition to the material in wiki:ANCOVA, you might want to check this page[3] and to utilize your search engine with keywords of interest on top of ANCOVA to narrow down results.

From the linear model perspective, if a response is considered to be a function of redshift (z), then z becomes a covariate. The significance of this covariate in addition to other factors in the model, can be tested later when one fully fit/analyze the statistical model. If one wants to design a model, say rotation speed (indicator of dark matter occupation) as a function of redshift, the degrees of spirality, and the number of companions – this is a very hypothetical proposal due to my lack of knowledge in observational cosmology. I only want to point that the model fitting problem can be seen from statistical modeling like ANCOVA by identifying covariates and relationships – because the covariate z is continuous, and the degrees are fixed effect (0 to 7, 8 tuples), and the number of companions are random effect (cluster size is random), the comprehensive model could be described by ANCOVA. To my knowledge, scatter plots and simple linear regression are marginalizing all additional contributing factors and information which can be the main contributors of correlations, although it may seem Y and X are highly correlated in the scatter plot. At some points, we must marginalize over unknowns. Nonetheless, we still have some nuisance parameters and latent variables that can be factored into the model, different from ignoring them, to obtain advanced insights and knowledge from observations in many measures/dimensions.

Something, I think, can be done with a small/ergonomic chart/table via hypothesis testing, multivariate regression, model selection, variable selection, dimension reduction, projection pursuit, or names of some state of the art statistical methods, is done in astronomy with numerous scatter plots, with colors, symbols, and lines to account all possible relationships within pairs whose correlation can be artificial. I also feel that trees, electricity, or efforts can be saved from producing nice looking scatter plots. Fitting/Analyzing more comprehensive models put into a statistical fashion helps to identify independent variables or covariates causing strong correlation, to find collinear variables, and to drop redundant or uncorrelated predictors. Bayes factors or p-values can be assessed for comparing models, testing significance their variables, and computing error bars appropriately, not the way that the null hypothesis probability is interpreted.

Lastly, ANCOVA is a complete [MADS]. :)

  1. This is not an assuring absolute statement but a personal impression after reading articles of various fields in addition to astronomy. My readings of other fields tell that many rely on correlation statistics but less scatter plots by adding straight lines going through data sets for the purpose of imposing relationships within variable pairs
  2. the way that chi-square fitting is done and the goodness-of-fit test is carried out is understood by the notion that X causes Y and by the practice that the objective function, the sum of (Y-E[Y|X])^2/σ^2 is minimized
  3. It’s a website of Vassar college, that had a first female faculty in astronomy, Maria Mitchell. It is said that the first building constructed is the Vassar College Observatory, which is now a national historic landmark. This historical factor is the only reason of pointing this website to drag some astronomers attentions beyond statistics.
]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/scatter-plots-and-ancova/feed/ 0
[MADS] logistic regression http://hea-www.harvard.edu/AstroStat/slog/2009/mads-logistic-regression/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-logistic-regression/#comments Tue, 13 Oct 2009 20:15:08 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=3797 Although a bit of time has elapsed since my post space weather, saying that logistic regression is used for prediction, it looks like still true that logistic regression is rarely used in astronomy. Otherwise, it could have been used for the similar purpose not under the same statistical jargon but under the Bayesian modeling procedures.

Maybe, some astronomers want to check out this versatile statistical method, wiki:logistic regression to see whether they can fit their data to this statistical method in order to model/predict observation rates, unobserved rates, undetected rates, detected rates, absorbed rates, and so on in terms of what are observed and additional/external observations, knowledge, and theories. I wonder what would it be like if the following is fit using logistic regression: detection limits, Eddington bias, incompleteness, absorption, differential emission measures, opacity, etc plus brute force Monte Carlo simulations emulating likely data to be fit. Then, responses are the probability of observed vs not observed as a function of redshift, magnitudes, counts, flux, wavelength/frequency, and other measurable variables or latent variables.

My simple reasoning that astronomers observe partially and they will never have complete sample, has imposed a prejudice that logistic regression would appear in astronomical literature rather frequently. Against my bet, it was [MADS]. All stat softwares have packages and modules for logistic regression; therefore, you have a data set, application is very straight forward.

———————————[added]
Although logistic regression models are given in many good tutorials, literature, or websites, it might be useful to have a simple but intuitive form of logistic regression for sloggers.

When you have binary responses, metal poor star (Y=1) vs. metal rich star (Y=2), and predictors, such as colors, distance, parallax, precision, and other columns in catalogs (X is a matrix comprised of these variables),
logit(Pr(Y=1|X))=\log \frac{Pr(Y=1|X)}{1-Pr(Y=1|X)} = \beta_o+{\mathbf X^T \beta} .
As astronomers fit a linear regression model to get the intercept and slope, the same approach is applied to get intercepts and coefficients of logistic regression models.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-logistic-regression/feed/ 0
[MADS] Kalman Filter http://hea-www.harvard.edu/AstroStat/slog/2009/mads-kalman-filter/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-kalman-filter/#comments Fri, 02 Oct 2009 03:18:32 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1397 I decide to discuss Kalman Filter a while ago for the slog after finding out that this popular methodology is rather underrepresented in astronomy. However, it is not completely missing from ADS. I see that the fulltext search and all bibliographic source search shows more results. Their use of Kalman filter, though, looked similar to the usage of “genetic algorithms” or “Bayes theorem.” Probably, the broad notion of Kalman filter makes it difficult my finding Kalman Filter applications by its name in astronomy since often wheels are reinvented (algorithms under different names have the same objective).

When I learned “Kalman filter” for the first time, I was not sure how to distinguish it from “Yule-Walker equation” (time series), “Pade approximant, (unfortunately, the wiki page does not have its matrix form). Wiener Filter” (signal processing), etc. Here are those publications, specifically mentioned the name Kalman filter in their abstracts found from ADS.

The motivation of introducing Kalman filter although it is a very well known term is the recent Fisher Lecture given by Noel Cressie at the JSM 2009. He is the leading expert in spatial statistics. He is the author of a very famous book in Spatial Statistics. During his presentation, he described challenges from satellite data and how Kalman filter accelerated computing a gigantic covariance matrix in kriging. Satellite data of meteorology and geosciences may not exactly match with astronomical satellite data but from statistical modeling perspective, the challenges are similar. Namely, massive data, streaming data, multi dimensional, temporal, missing observations in certain areas, different exposure time, estimation and prediction, interpolation and extrapoloation, large image size, and so on. It’s not just focusing denoising/cleaning images. Statisticians want to find the driving force of certain features by modeling and to perform statistical inference. (They do not mind parametrization of interesting metric/measure/quantity for modeling or they approach the problem in a nonparametric fashion). I understood the use of Kalman filter for a fast solution to inverse problems for inference.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-kalman-filter/feed/ 0
[MADS] compressed sensing http://hea-www.harvard.edu/AstroStat/slog/2009/mads-compressed-sensing/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-compressed-sensing/#comments Fri, 11 Sep 2009 04:20:54 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1904 Soon it’ll not be qualified for [MADS] because I saw some abstracts with the phrase, compressed sensing from arxiv.org. Nonetheless, there’s one publication within refereed articles from ADS, so far.

http://adsabs.harvard.edu/abs/2009MNRAS.395.1733W.
Title:Compressed sensing imaging techniques for radio interferometry
Authors: Wiaux, Y. et al.
Abstract: Radio interferometry probes astrophysical signals through incomplete and noisy Fourier measurements. The theory of compressed sensing demonstrates that such measurements may actually suffice for accurate reconstruction of sparse or compressible signals. We propose new generic imaging techniques based on convex optimization for global minimization problems defined in this context. The versatility of the framework notably allows introduction of specific prior information on the signals, which offers the possibility of significant improvements of reconstruction relative to the standard local matching pursuit algorithm CLEAN used in radio astronomy. We illustrate the potential of the approach by studying reconstruction performances on simulations of two different kinds of signals observed with very generic interferometric configurations. The first kind is an intensity field of compact astrophysical objects. The second kind is the imprint of cosmic strings in the temperature field of the cosmic microwave background radiation, of particular interest for cosmology.

As discussed, reconstructing images from noisy observations is typically considered as an ill-posed problem or an inverse problem. Owing to the personal lack of comprehension in image reconstruction of radio interferometry observation based on sample from Fourier space via inverse Fourier transform, I cannot judge how good this new adaption of compressed sensing for radio astronomical imagery is. I think, however, compressed sensing will take over many of traditional image reconstruction tools due to their shortage in forgiving sparsely represented large data/images .

Please, check my old post on compressed sensing for more references to the subject like the Rice university repository in addition to references from Wiaux et al. It’s a new exciting field with countless applications, already enjoying wide popularity from many scientific and engineering fields. My thought is that well developed compressed sensing algorithms might resolve bandwidth issues in satellite observations/communication by transmiting more images within fractional temporal intervals for improved image reconstruction.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-compressed-sensing/feed/ 0
[MADS] ARCH http://hea-www.harvard.edu/AstroStat/slog/2009/mads-arch/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-arch/#comments Fri, 04 Sep 2009 18:30:27 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=3477 ARCH (autoregressive conditional heteroscedasticity) is a statistical model that considers the variance of the current error term to be a function of the variances of the previous time periods’ error terms. I heard that this model made Prof. Engle a Nobel prize recipient.

Heteroscedesticity in regression problems in astronomy has been discussed with various kind of observations since astronomers are interested in correlations and causality between variables in the data set. Perhaps, the heteroscedasticity in volatilty of finance does not have any commonality with astronomical times series data, which could be the primary reason this celebrated model does not appear in ADS. However, there are various times series models branched from ARCH that consider heteroscedasticity in errors and I hope some can be useful to astronomers for analyzing times series data with inhomogeneous errors.

[Note] All links lead to wikipedia.org

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-arch/feed/ 0
[MADS] Kriging http://hea-www.harvard.edu/AstroStat/slog/2009/mads-kriging/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-kriging/#comments Wed, 26 Aug 2009 02:19:26 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=3435 Kriging is the first thing that one learns from a spatial statistics course. If an astronomer sees its definition and application, almost every astronomer will say, “Oh, I know this! It is like the 2pt correlation function!!” At least this was my first impression when I first met kriging.

There are three distinctive subjects in spatial statistics: geostatistics, lattice data analysis, and spatial point pattern analysis. Because of the resemblance between the spatial distribution of observations in coordinates and the notion of spatially random points, spatial statistics in astronomy has leaned more toward the spatial point pattern analysis than the other subjects. In other fields from immunology to forestry to geology whose data are associated spatial coordinates of underlying geometric structures or whose data were sampled from lattices, observations depend on these spatial structures and scientists enjoy various applications from geostatistics and lattice data analysis. Particularly, kriging is the fundamental notion in geostatistics whose application is found many fields.

Hitherto, I expected that the term kriging can be found rather frequently in analyzing cosmic micro-wave background (CMB) data or large extended sources, wide enough to assign some statistical models for understanding the expected geometric structure and its uncertainty (or interpolating observations via BLUP, best linear unbiased prediction). Against my anticipation, only one referred paper from ADS emerged:

Topography of the Galactic disk – Z-structure and large-scale star formation
by Alfaro, E. J., Cabrera-Cano, J., and Delgado (1991)
in ApJ, 378, pp. 106-118

I attribute this shortage of applying kriging in astronomy to missing data and differential exposure time across the sky. Both require underlying modeling to fill the gap or to convolve with observed data to compensate this unequal sky coverage. Traditionally the kriging analysis is only applied to localized geological areas where missing and unequal coverage is no concern. As many survey and probing missions describe the wide sky coverage, we always see some gaps and selection biases in telescope pointing directions. So, once this characteristics of missing is understood and incorporated into models of spatial statistics, I believe statistical methods for spatial data could reveal more information of our Galaxy and universe.

A good news for astronomers is that nowadays more statisticians and geo-scientists working on spatial data, particularly from satellites. These data are not much different compared to traditional astronomical data except the direction to which a satellite aims (inward or outward). Therefore, data of these scientists has typical properties of astronomical data: missing, unequal sky coverage or exposure and sparse but gigantic images. Due to the increment of computational power and the developments in hierarchical modeling, techniques in geostatistics are being developed to handle these massive, but sparse images for statistical inference. Not only denoising images but they also aim to produce a measure of uncertainty associated with complex spatial data.

For those who are interested in what spatial statistics does, there are a few books I’d like to recommend.

  • Cressie, N (1993) Statistics for spatial data
    (the bible of statistical statistics)
  • Stein, M.L. (2002) Interpolation of Spatial Data: Some Theory for Kriging
    (it’s about Kriging and written by one of scholarly pinnacles in spatial statistics)
  • Banerjee, Carlin, and Gelfand (2004) Hierarchical Modeling and Analysis for Spatial Data
    (Bayesian hierarchical modeling is explained. Very pragmatic but could give an impression that it’s somewhat limited for applications in astronomy)
  • Illian et al (2008) Statistical Analysis and Modelling of Spatial Point Patterns
    (Well, I still think spatial point pattern analysis is more dominant in astronomy than geostatistics. So… I feel obliged to throw a book for that. If so, I must mention Peter Diggle’s books too.)
  • Diggle (2004) Statistical Analysis of Spatial Point Patterns
    Diggle and Ribeiro (2007) Model-based Geostatistics
]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-kriging/feed/ 0
[MADS] Adaptive filter http://hea-www.harvard.edu/AstroStat/slog/2009/mads-adaptive-filter/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-adaptive-filter/#comments Thu, 04 Jun 2009 21:36:50 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1385 Please, do not confuse adaptive filter (hereafter, AF) with adaptive optics (hereafter, AO). I have no expertise in both fields but have small experiences to tell you the difference. Simply put, AF is comparable to software as opposed to AO to hardware, which is for constructing telescopes in order to collect data with sharpness and to minimize time varying atmospheric blurring. When you search adaptive filter in ADS you’ll more likely come across with adaptive optics and notch filter.

Not just the hardware part, but I was interested in how astronomers have applied theories and algorithms of AF to acquire cleaner data that are processed with speed and efficiency. AF is used for analyzing data as well as optimizing filters, analytically and statistically. My impression about AF textbooks, at least, is that they are written like special topics of applied statistics or optimization. A long story short, against my expectation for frequent allusion from ADS, adaptive filter turned out to be almost [MADS].

2005AJ….130.2916P
Title:Programmable Real-Time Cancellation of GLONASS Interference with the Green Bank Telescope
Authors:Poulsen, A. J.; Jeffs, B. D.; Warnick, K. F.; Fisher, J. R

I haven’t investigated the usage of AF in astronomy in a general manner. Therefore, it’s possible that principles and algorithms discussed in books like Adaptive Filtering Primer with MATLAB (A. D. Poularikas and Z. M. Ramadan), Adaptive Filter (A. H. Sayad) and Adative Filter Theory (S. Haykin) are already presented in astronomy literature under different names. If not, to assist data transmission, compression, evaluation, coding, decoding, transformation, processing, or what ever it is called depending on scientists’ specialties, with AF algorithms, one can achieve more efficient and fast results from signal and image processing analysis with almost no loss of information.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-adaptive-filter/feed/ 0
[MADS] data depth http://hea-www.harvard.edu/AstroStat/slog/2009/mads-data-depth/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-data-depth/#comments Tue, 02 Jun 2009 02:51:29 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1798 How would you assign orders to multivariate data? If you have your strategy to achieve this ordering task, I’d like to ask, “is your strategy affine invariant?” meaning that shift and rotation invariant.

Let’s make the question a bit concrete. What would be the largest point from a scatter plot below? Thirty random bivariate data were generated from bivariate normal distribution (mu=[1,1], sigma=[[1,-.9],[-.9,1]]).datadepth If you have applied your rule of assigning ranks to some largest or smallest points, could you continue to apply it to all data points? Once you’ve done it, shift your data w.r.t. the mean (not mu; although I told you mu, data do not tell what true mu is but we can estimate mu and assign uncertainty on that estimate via various methods, beyond computing sample standard deviation or traditional chi-square methods for more complex data and models. Mean, a solution to the chi-square method or a least square method to the hypothesis (model) mu=\bar{X}, is only one estimator among many estimators) and rotate them clock-wise by 90 degrees. Does each point preserve the same rank when you apply your ranking method to the shifted and rotated multivariate data?

First, if you think about coordinate-wise ranking, you’ll find out that two sets of ranks are not matching. Additional questions I like to add are, which axis you would consider first? X or Y? and how to handle ties in one coordinate (since bivariate normal, a continuous distribution, the chance of ties is zero but real data often show ties).

The notion of Data Depth is not employed in astronomy as well as other fields because variables have physical meaning so that multidimensional data cloud is projected onto the most important axis and from which luminosity functions/histograms are drawn to discover a trend, say a clump at a certain color or a certain redshift indicating an event, epoch, or episode in formation history. I wonder if they truly look like cluttered when they are deprojected into the original multi-dimensional space. Remember the curse of dimensionality. Even though the projected data show bumps, in multidimensional space, a hypothesis testing will conclude complete spatial randomness, or there is no clump. Consider transformations and distance metrics. Or how nearest neighbors are defined. Simply, consider mathematical distances of pairs of galaxy for statistical analysis, and these galaxies have location metrics: longitude, latitude, and redshift (3D spatial data points), and scales such as magnitudes, group names, types, shapes, and classification metrics, which cannot be used directly into computing Euclidean distance. Choices of metrics and transformation are different from mathematics and algorithmic standpoints, and from practical and astronomical viewpoints to assign ranks to data. The importance of ordering data is 1. identifying/removing outliers, 2. characterizing data distribution quickly, for example, current contour plots can be accelerated, 3. updating statistics efficiently – only looking into neighbors instead of processing whole data.

The other reason that affine invariant ordering of multivariate data finds no use in astronomy can be attributed to its nonparametric methodology nature. So far, my impression is that nonparametric statistics is a relatively unknown territory to astronomers. Unless a nonparametric statistic is designed for hypothesis testings and interval estimations, this nonparametric method does not give confidence levels/intervals that astronomers always like to report, “how confident is the result of ordering multivariate data or what is the N sigma uncertainty that this data point is the center within plus minus 2?” Nonparametric bootstrap or jackknife are rarely used for computing uncertainties to compensate the fact that a nonparametric statistic only offers a point estimate (best fit). Regardless of the lack of popularity in these robust point estimators from nonparametric statistics, the notion of data depth and nonparametric methods, I believe, would help astronomers to retrieve summary statistics quickly on multidimensional massive and streaming data from future surveys. This is the primary reason to introduce and advertise data depth in this post.

Data depth is a statistic that measures ranks of multivariate data. It assigns a numeric value to a vector (data point) based on its centrality relative to a data cloud. The statistical depth satisfies the following conditions:

  1. affine invariance
  2. maximality at the center: data depth-wise median pertains highest rank in univariate data, the datum of maximum value.
  3. monotonicity
  4. depth goes zero when euclidean distance of the datum goes to infinity

Nowadays, observation in several bands is more typical than observing in two bands or three, at most. The structures such as bull’s eye or god’s fingers are not visible in 1D projection if location coordinates are considered instead of colors. In order to understand such multidimensional data, as indicated above, this [MADS], data depth might enhance the efficiency of data pipe-lining and computing summary statistics. It’s related to handling multivariate outliers and storing/discarding unnecessary/unscientific data. Data points on the surface of data clouds or tiny parasite like data clumps accompanying the important data clouds are not visible nor identifiable when the whole data set is projected onto one or two dimensions. We can define a train of data summary from data depths matching with traditional means, standard devidations, skewness, and kurtosis measures. If a datum and a tiny data cloud does not belong to that set, we can discard those outliers with statistical reasons and separate them from the main archive/catalog. I’d like to remind you that a star of modest color does not necessary belong to the main sequence, if you use projected 1D histogram onto a color axis, you would not know this star is a pre-main sequence star, a giant star, or a white dwarf. This awareness of seeing a data set in a multidimensional space and adapting suitable statistics of multivariate data analysis is critical when a query is executed on massive and streaming data. Otherwise, one would get biased data due to coordinate wise trimming. Also, data depth based algorithmic efficient methods could save wasting cpu time, storage, and human labor for validation.

As often we see data trimming happens based on univariate projection and N sigma rules. Such trimming tends to remove precious data points. I want to use a strategy to remove parasites, not my long tail. As a vertebrate is the principle axis, tail and big nose is outlier than parasites swimming near your belly. Coordinate wise data trimming with N sigma rules will chop the tail and long nose first, then later they may or may not remove parasites. Without human eye verification, say virtual observers (people using virtual observatories for their scientific researches) are likely to receive data with part of head and tail missing but those parasites are not removed. This analogy is applicable to fields stars to stellar clusters, dwarf galaxies to a main one, binaries to single stars, population II vs. population I stars, etc.
Outliers are not necessarily bad guys. Important discoveries were made by looking into outliers. My point is that identifying outliers with respect to multidimensional data distribution is important instead of projecting data onto the axis of a vertebrate.

In order to know further about data depth, I recommend these three papers, long but crispy and delicious to read.

  • Multivariate analysis by data depth: Descriptive statistics, graphics and inference (with discussion)
    R.Liu, J. Parelius, and K. Singh (1999) Ann. Stat. vol. 27 pp. 783-858
    (It’s a comprehensive survey paper on data depth)
  • Mathematics and the Picturing of Data
    Tukey, J.W. (1974) Proceedings of the International Congress of Mathematicians, 2, 523-531
  • The Ordering of Multivariate Data
    Barnett, V. (1976) J. R. Stat. Soc. A, 139(3), 318-355
    (It answers more than three decades ago how to order multivariate data.
  • You might want to check publications by R. Serfling (scroll down a bit).

As we have various distance metrics, there are various statistical depths I already mentioned one in my [MADS] Mahalanobis distance Here is a incomplete list of data depths that I heard of:

  • mahalanobis depth
  • simplicial depth,
  • half space depth (Tukey depth)
  • Oja depth
  • convex hull peeling depth
  • projection depth
  • majority depth

I do not know the definitions of all depths and how to compute depths based on them (General notions of statistical depth function Serfling and Zuo (2000, Annals of Statistics, 28, pp.461-482) is very useful to get theoretical understanding about statistical data depth). What I do know is that except the Mahalanobis depth, the computational complexity of other depths is relatively simple. The reason for excluding the Mahalanobis depth is that it requires estimating large covariance matrix and inverting it. Therefore, ordering multivariate data with these data depth measures, including assigning medians and quantiles, statistically, that is more robust than coordinate-wise mean or additive errors. Also, removing outliers based on statistical depths pertains more statistical senses but less burdensome for computers and human laborers. Speaking of contouring, having assigned ranks by numeric values on each point, by collecting the multivariate data points of a certain level of ranks will lead to drawing a quick contour.

The story turned out to be longer than I expected, but the one take home message is

Q:How would you assign orders to multivariate data?
A:Statisticians have discussed the notion of data depth to answer the question.

———————————————————————
NOTE: The usages of depth in statistics and astronomy are different. In astronomy, depth is associated with the distance that a telescope can reach. In fact, it’s an euphemism. To be more precise, the depth of a telescope is somewhat defined by how luminously dim away that a telescope cannot catch photons from that source. Deep sky surveys imply that the telescope(s) in the survey projects can see deeper than the previous ones. Or more dim objects than previous telescopes. In statistics, depth is a bit abstract and conceptual based on definition. Data depth is multivariate version of quantile for univariate data on which one can easily order data points from the smallest to largest.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-data-depth/feed/ 0
[MADS] Law of Total Variance http://hea-www.harvard.edu/AstroStat/slog/2009/mads-law-of-total-variance/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-law-of-total-variance/#comments Fri, 29 May 2009 04:54:20 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1734 This simple law, despite my trial of full text search, was not showing in ADS. As discussed in systematic errors, astronomers, like physicists, show their error components in two additive terms; statistical error + systematic error. To explain such decomposition and to make error analysis statistically rigorous, the law of total variance (LTV) seems indispensable.

V[X] = V[E[X|Y]] + E[V[X|Y]]

(X and Y are random variables, and X indicates observed data. In addition, V and E stands for variance and expectation, respectively. Instead of X, f(X_1,…,X_n) can be plugged in to represent a best fit. In other words, a best fit is a solution of the chi-square minimization which is a function of data). For Bayesian, the uncertainty of theta, parameter of interest, is

V[theta]=V[E[theta|Y]] + E[V[theta|Y]]

Suppose Y is related to systematics. E[theta|Y] is a function of Y so that V[E[theta|Y] indicates systematic error. V[theta|Y] is statistical error given Y which reflects the fact that unless the parameters of interest and systematics are independent, statistical error cannot be quantified into a single factor next to a best fit. If parameter of interest, theta is independent of Y and Y is fixed, then the uncertainty in theta is solely come from statistical uncertainty (Let’s not consider “model uncertainty” for the time being).

In conjunction of astronomers’ systematic error and statistical error decomposition or representing uncertainties in quadrature (error2total = error2 stat+error2sys), statisticians use mean square error (MSE) as total error, in which variance matches statistical error, and bias^2 does systematic error.

MSE = Variance+ bias^2

Now it comes to a question, is systematic error bias? Those methods based on quadratures or parameterization of systematics for marginalization consider systematic error as bias although no account explicitly said so. According to the law of total variance unless it’s orthogonal/independent, quadrature is not proper way to handle systematic uncertainties prevailing in all instruments. Generally parameters (data) and systematics are nonlinearly correlated and hard to factorize (instrument specific empirical studies exist to offer correction factors due to systematics; however, such factors work only on specific cases and the process of defining correction factors is hard to be generalized). Because of the varying nature of systematics over the parameter space, instead of MSE

MISE = \int [\hat f(x)- f(x)]^2 f(x) dx

or mean integrated square error might be of use. The estimator of f(x), or \hat f(x) is either parametrically or nonparametrically estimable while incorporating systematics and correlation structures with statistical errors as a function of a certain domain x. MISE can be viewed as a robust version of chi-square methods but details have not been explored to account for the following identity.

MISE=\int [E\hat f(x) -f(x)]^2 dx + \int Var \hat f(x) dx

This equation may or may not look simple. Perhaps, the expansion of the above identify could explain more on the error decomposition.

MISE(\hat f(x)) = \int E(\hat f(x)-f(x))^2 dx
=\int MSE_x (\hat f) dx = \int (\hat f(x)- f(x))^2 f(x) dx
= \int [E \hat f(x)- f(x)]^2dx +\int Var \hat f(x) dx
integreated squared bias + the integrated variance (overall systematic error + overall statistical error)

Furthermore, it robustly characterizes uncertainties from systematics, i.e. calibration uncertainties in data analysis. Note that estimating f(x) or \hat f(x) reflects complex structures in uncertainty analysis; whereas, the chi square minimization estimates f(x) via piecewise straight horizontal lines, assumes the homogeneous error in each piece (bin), forces statistical and systematic errors to be orthogonal, and as a consequence, inflates the size of error or produces biased best fits.

Either LTV, MSE, or MISE, even if we do not know the true model f(x) — if unknown, assessing statistical analysis results such as confidence levels/intervals may not be feasible; the reason that chi-square methods offer best fits and its N sigma error bars is that it assume the true model is Gaussian, or N(f(x),\sigma^2), or E(Y|X)=f(X)+\epsilon, V(\epsilon)=\sigma^2 where f(x) is a source model. On the other hand, Monte Carlo simulations, resampling methods like bootstrap, or posterior predictive probability (ppp) allows to infer the truth from which one can evaluate the p-value to indicate one’s confidence on the result from fitting analysis in a nonparametric fashion. — setting up proper models for \hat f(x) or \theta|Y would help assessing the total error in a more realistic manner than the chi-square minimization, additive errors, gaussian quadrature, or subjective expertise on systematics. The underlying notions and related theoretical statistics methodologies of LTV, MSE, or MISE could clarify the questions like how to quantify systematic errors and how systematic uncertainties are related to statistical uncertainties. Well, nothing will make me and astronomers happier if those errors are independent and additive. Even more exuberant, systematic uncertainty can be factorized.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-law-of-total-variance/feed/ 0
[MADS] Chernoff face http://hea-www.harvard.edu/AstroStat/slog/2009/mads-chernoff-face/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-chernoff-face/#comments Thu, 02 Apr 2009 16:00:41 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=2059 I cannot remember when I first met Chernoff face but it hooked me up instantly. I always hoped for confronting multivariate data from astronomy applicable to this charming EDA method. Then, somewhat such eager faded, without realizing what’s happening. Tragically, this was mainly due to my absent mind.

After meeting Prof. Herman Chernoff unexpectedly – I didn’t know he is Professor Emeritus at Harvard – the urge revived but I didn’t have data, still then. Alas, another absent mindedness: I don’t understand why I didn’t realize that I already have the data, XAtlas for trying Chernoff faces until today. Data and its full description is found from the XAtlas website (click). For Chernoff face, references suggested in Wiki:Chernoff face are good. I believe some folks are already familiar with Chernoff faces from a New York Times article last year, listed in Wiki (or a subset characterized by baseball lovers?).

Capella is a X-ray bright star observed multiple times for Chandra calibration. I listed 16 ObsIDs in the figures below at each face, among 18+ Capella observations (Last time when I checked Chandra Data Archive, 18 Capella observations were available). These 16 are high resolution observations from which various metrics like interesting line ratios and line to continuum ratios can be extracted. I was told that optically it’s hard to find any evidence that Capella experienced catastrophic changes during the Chandra mission (about 10 years in orbit) but the story in X-ray can’t be very different. In a dismally short time period (10 years for a star is a flash or less), Capella could have revealed short time scale high energy activities via Chandra. I just wanted to illustrate that Chernoff faces could help visualizing such changes or any peculiarities through interpretation friendly facial expressions (Studies have confirmed babies’ ability in facial expression recognitions). So, what do you think? Do faces look similar/different to you? Can you offer me astronomical reasons for why a certain face (ObsID) is different from the rest?

faces faces2

p.s. In order to draw these Chernoff faces, check descriptions of these R functions, faces() (yields the left figure) or faces2() (yields the right figure) by clicking on the function of your interest. There are other variations and other data analysis systems offer different fashioned tools for drawing Chernoff faces to explore multivariate data. Welcome any requests for plots in pdf. These jpeg files look too coarse on my screen.

p.p.s. Variables used for these faces are line ratios and line to continuum ratios, and the order of these input variables could change countenance but impressions from faces will not change (a face with distinctive shapes will look different than other faces even after the order of metrics/variables is scrambled or using different Chernoff face illustration tools). Mapping, say from an astronomical metric to the length of lips was not studied in this post.

p.p.p.s. Some data points are statistical outliers, not sure about how to explain strange numbers (unrealistic values for line ratios). I hope astronomers can help me to interpret those peculiar numbers in line/continuum ratios. My role is to show that statistics can motivate astronomers for new discoveries and to offer different graphics tools for enhancing visualization. I hope these faces motivate some astronomers to look into Capella in XAtlas (and beyond) in details with different spectacles, and find out the reasons for different facial expressions in Capella X-ray observations. Particularly, ObsID 1199 is most questionable to me.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-chernoff-face/feed/ 2