The AstroStat Slog » bootstrap

missing data

hlee — Mon, 27 Oct 2008 13:24:22 +0000

The notions of missing data are overall different between two communities. I tend to think missing data carry as good amount of information as observed data. Astronomers…I’m not sure how they think but my impression so far is that a missing value in one attribute/variable from a object/observation/informant, all other attributes related to that object become useless because that object is not considered in scientific data analysis or model evaluation process. For example, it is hard to find any discussion about imputation in astronomical publication or statistical justification of missing data with respect to inference strategies. On the contrary, they talk about incompleteness within different variables. Putting this vague argument with a concrete example, consider a catalog of multiple magnitudes. To draw a color magnitude diagram, one needs both color and magnitude. If one attribute is missing, that star will not appear in the color magnitude diagram and any inference methods from that diagram will not include that star. Nonetheless, one will trying to understand how different proportions of stars are observed according to different colors and magnitudes.

I guess this cultural difference is originated from the quality of data. Speaking of typical size of that data sets that statisticians handle, a child can count the number of data points. The size of astronomical data, only rounded numbers of stars in the catalog are discussed and dropping some missing data won’t affect the final results.

Introducing how statisticians handle missing data may benefit astronomers who handles small catalogs due to observational challenge in the survey. Such data with missing values can be put into statistically rigorous data analysis processes in stead of ad hoc procedures of obtaining complete cases that risk throwing many data points.

In statistics, utilizing information of missing data enhances information toward the direction that the inference method tries to retrieve. Despite larger, it’s better to have error bars than nothing. My question is what are statistical proposals for astronomers to handle missing data? Even though I want to find such list, instead, I give a few somewhat nontechnical papers that explain the following missing data types in statistics and a few statistics books/articles that statisticians often cite.

Data mining and the impact of missing data by M.L. Brown and J.F.Kros, Industrial Management and Data Systems (2003) Vol. 103, No. 8, pp.611-621
Missing Data: Our View of the State of the Art by J.L.Schafer and J.W.Graham, Psychological Methods (2002) Vol.7, No. 2, pp. 147-177
Missing Data, Imputation, and the Bootstrap by B. Efron, JASA (1984) 89 426 p. 463- and D.B.Rubin’s comment
The multiple imputation FAQ page (web) by J. Shafer
Statistical Analysis with Missing Data by R.J.A. Little and D.B.Rubin (2002) 2nd ed. New York: Wiley.
The Curse of the Missing Data (web) by Yong Kim
A Review of Methods for Missing Data by T.D.Pigott, Edu. Res. Eval. (2001) 7(4),pp.353-383 (survey of missing data analysis strategies and illustration with “asthma data”)

Pigott discusses missing data methods to general audience in plain terms under the following categories: complete-cases, available-cases, single-value imputation, and more recent model-based methods, maximum likelihood for multivariate normal data, and multiple imputation. Readers of craving more information see Schafer and Graham or books by Schafer (1997) and Little and Rubin (2002).

Most introductory articles begin with common assumptions like missing at random (MAR) or missing at completely random (MCAR) but these seem not apply to typical astronomical data sets (I don’t know exactly why yet – I cannot provide counter examples to prove – but that’s what I have observed and was told). Currently, I like to find ways to link between statistical thinking about missing data and modeling to astronomical data of missing through discovering commonality in their missing properties). I hope you can help me and others of such efforts. For your information, the following are the short definitions of these assumptions:

data missing at random : missing for reasons related to completely observed variables in the data set
data missing completely at random : the complete cases are a random sample of the originally identified set of cases
non-ignorable missing data : the reasons for the missing observations depend on the values of those variables.
outliers treated as missing data
the assumption of an ignorable response mechanism.

Statistical researches are conducted traditionally under the circumstance that complete data are available and the goal is characterizing inference results from the missing data analysis methods by comparing results from data with complete information and dropping observations on the variables of interests. Simulations enable to emulate these different kind of missing properties. A practical astronomer may raise a question about such comparison and simulating missing data. In real applications, such step is not necessary but for the sake of statistical/theoretical authenticity/validation and approval of new missing data analysis methods, the comparison between results from complete data and missing data is unavoidable.

Against my belief that statistical analysis with missing data is applied universally, it seems like only regression type strategy can cope with missing data despite the diverse categories of missing data, so far. Often cases in multivariate data analysis in astronomy, the relationship between response variables and predictors is not clear. More frequently, responses do not exist but the joint distribution of given variables is more cared. Without knowing data generating distribution/model, analyzing arbitrarily built models with missing data for imputation and for estimation seems biased. This gap of handling different data types is the motivation of introducing statistical missing data analysis to astronomers, but statistical strategies of handing missing data may be seen very limited. I believe, however, some “new” concepts in missing data analysis approaches can be salvaged like the assumptions for analyzing data with underlying multivariate normal distribution, favored by astronomers many of whom apply principle component analysis (PCA) nowadays. Understanding conditions for multivariate normal distribution and missing data more rigorously leads astronomers to project their data analysis onto the regression analysis space since numerous survey projects in addition to the emergence of new catalogs pose questions of relationships among observed variables or estimated parameters. The broad areas of regression analysis embraces missing data in various ways and likewise, vast astronomical surveys and catalogs need to move forward in terms of adopting proper data analysis tools to include missing data since instead of laws of physics, finding relationships among variables empirically is the scientific objective of surveys, and missing data are not ignorable. I think that tactics in missing data analysis will allow steps forward in astronomical data analysis and its statistical inference.

Statisticians or other scientists utilizing statistics might have slightly different ways to call the strategies of missing data analysis, my way of putting the strategies of missing data analysis described in above texts is as follows:

complete case analysis (caveat: relatively few cases may be left for the analysis and MCAR is assumed),
available case analysis (pairwise deletion, delete selected variables. caveat: correlations in variable pairs)
single-value imputation (typically mean value is imputed, causing biased results and underestimated variance, not recommended. )
maximum likelihood, and
multiple imputation (the last two are based on two assumptions: multivariate normal and ignorable missing data mechanism)

and the following are imputation strategies:

mean substituion,
case substitution (scientific knowledge authorizes substitution),
hot deck imputation (external sources imputes imputation),
cold deck imputation (values drawn from the next most similar case but difficulty in defining what is “similar”),
regression imputation (prediction with independent variables and mean imputation is a special case) and
multiple imputation

Some might prefer the following listing (adopted from Gelman and Brown’s regression analysis book):

simple missing data approaches that retain all the data

-mean imputation
-last value carried forward
-using information from related observation
-indicator variables for missingness of categorical predictors
-indicator varibbles for missingness of continuous predictors
-imputation based on logical values

random imputation of a single variables
imputation of several missing variables
model based imputation
combining inferences from multiple imputation

Explicit assumptions are acknowledged through statistical missing data analysis compared to subjective data processing toward complete data set. I often see discrepancies between plots from astronomical journals and linked catalogs where missing data including outliers reside but through the subjective data cleaning step they do not appear in plots. On the other hand, statistics exclusively explains assumptions and conditions of missing data. However, I don’t know what is proper or correct from scientific viewpoints. Such explication does not exist and judgments on assumptions on missing data and processing them left to astronomers. Moreover, astronomers have the advantages like knowledge in physics for imputing data more suitably and subtly.

Schafer and Graham described, with or without missing data, the goal of a statistical procedure should be to make valid and efficient inferences about a population of interest — not to estimate, predict, or recover missing observations nor to obtain the same results that we would have seen with complete data.

The following quote from the above web link (Y. Kim) says more.

Dealing with missing data is a fact of life, and though the source of many headaches, developments in missing data algorithms for both prediction and parameter estimation purposes are providing some relief. Still, they are no substitute for critical planning. When it comes to missing data, prevention is the best medicine.

Missing entries in astronomical catalogs are unpreventable; therefore, one needs statistically improved strategies more than ever because of the increase volume of surveys and catalogs proportionally many missing data reside. Or current methods using complete data (getting rid of all observations with at least one missing entry) could be the only way to go. There are more rooms left to discuss strategies case by case, which would come in future post. This one is already too long.

Parametric Bootstrap vs. Nonparametric Bootstrap

hlee — Thu, 11 Sep 2008 02:46:13 +0000

The following footnotes are from one of Prof. Babu’s slides but I do not recall which occasion he presented the content.

– In the XSPEC packages, the parametric bootstrap is command FAKEIT, which makes Monte Carlo simulation of specified spectral model.
– XSPEC does not provide a nonparametric bootstrap capability.

Parametric Bootstrap: $$X_1^*,…,X_n^* \sim F(\cdot;\theta_n)$$
Both $$\sqrt{n} \sup_x |F_n(x)-F(x;\theta_n)|$$ and $$\sqrt{n} \sup_x |F_n^*(x)-F(x;\theta_n^*)|$$ have the same limiting distribution.^[1]

Nonparametric Bootstrap:$$X_1^*,…,X_n^* \sim F_n.$$
A bias correction $$B_n(x)=F_n(x)-F(x;\theta_n)$$ is needed.
$$\sqrt{n} \sup_x |F_n(x)-F(x;\theta_n)|$$ and $$\sqrt{n} \sup_x |F_n^*(x)-F(x;\theta_n^*)-B_n(x)|$$ have the same limiting distribution.^[2]

In the XSPEC packages, the parametric bootstrap is command FAKEIT, which makes Monte Carlo simulation of specified spectral model.
XSPEC does not provide a nonparametric bootstrap capability

[ArXiv] 4th week, May 2008

hlee — Sun, 01 Jun 2008 03:59:15 +0000

Eight astro-ph papers and two statistics paper are listed this week. One statistics paper discusses detecting filaments and the other talks about maximum likelihood estimation of satellite images (clouds).

[astro-ph:0805.3532] Balan and Lahav
ExoFit: Orbital Parameters of Extra-solar Planets from Radial Velocities (MCMC)
[astro-ph:0805.3983] R. G. Carlberg et al.
Clustering of supernova Ia host galaxies (Jackknife method is utilized).
[astro-ph:0805.4005] Kurek, Hrycyna, & Szydlowski
From model dynamics to oscillating dark energy parametrisation (Bayes factor)
[astro-ph:0805.4136] C. Genovese et al.
Inference for the Dark Energy Equation of State Using Type Ia Supernova Data
[math.ST:0805.4141] C. Genovese et al.
On the path density of a gradient field (detecting filaments via kernel density estimation, KDE)
[astro-ph:0805.4342] C. Espaillat et al.
Wavelet Analysis of AGN X-Ray Time Series: A QPO in 3C 273?
[astro-ph:0805.4414] Tegmark and Zaldarriaga
The Fast Fourier Transform Telescope
[astro-ph:0805.4417] A. Georgakakis et al.
A new method for determining the sensitivity of X-ray imaging observations and the X-ray number counts
[stat.AP:0805.4598] E. Anderes et al.
Maximum Likelihood Estimation of Cloud Height from Multi-Angle Satellite Imagery

[ArXiv] 2nd week, May 2008

hlee — Mon, 19 May 2008 14:42:56 +0000

There’s no particular opening remark this week. Only I have profound curiosity about jackknife tests in [astro-ph:0805.1994]. Including this paper, a few deserve separate discussions from a statistical point of view that shall be posted.

[astro-ph:0805.1290]R. Barnard, L. Shaw Greening, U. Kolb
A multi-coloured survey of NGC 253 with XMM-Newton: testing the methods used for creating luminosity functions from low-count data
[astro-ph:0805.1469] Philip J. Marshall et al.
Automated detection of galaxy-scale gravitational lenses in high resolution imaging data
[astro-ph:0805.1470] E. P. Kontar, E. Dickson, J. Kasparova
Low-energy cutoffs and in electron spectra of solar flares: statistical survey (It is not statistically rigorous but the topic can be connected to dip tests or gap tests in statistics)
[astro-ph:0805.1936] J. Yee & B. Gaudi
Characterizing Long-Period Transiting Planets Observed by Kepler (discusses uncertainty in light curves and Fisher matrix)
[astro-ph:0805.1994] the QUad collaboration: C. Pryke et al.
Second and third season QUaD CMB temperature and polarization power spectra (What is jackknife tests? A brief scan of the paper does not register with my understanding of jackknifing. It looks more close to cross validation. Another slog topic shall come: bootstrap, cross validation, jackknife, and resampling.)
[astro-ph:0805.2121] N. Cole et al.
Maximum Likelihood Fitting of Tidal Streams With Application to the Sagittarius Dwarf Tidal Tails
[astro-ph:0805.2155] J Yoo & M Zaldarriaga
Improved estimation of cluster mass profiles from the cosmic microwave background
[astro-ph:0805.2207] A.Vikhlinin et al.
Chandra Cluster Cosmology Project II: Samples and X-ray Data Reduction (it mentions calibration uncertainty and background, can it be a reference to stacking, coadding, source detection, etc?)
[astro-ph:0805.2325] J.M. Loh
A valid and fast spatial bootstrap for correlation functions
[astro-ph:0805.2326] T. Wickramasinghe, M. Struble, J. Nieusma
Observed Bimodality of the Einstein Crossing Times of Galactic Microlensing Events

[ArXiv] 3rd week, Apr. 2008

hlee — Mon, 21 Apr 2008 01:05:55 +0000

The dichotomy of outliers; detecting outliers to be discarded or to be investigated; statistics that is robust enough not to be influenced by outliers or sensitive enough to alert the anomaly in the data distribution. Although not related, one paper about outliers made me to dwell on what outliers are. This week topics are diverse.

[astro-ph:0804.1809] H. Khiabanian, I.P. Dell’Antonio
A Multi-Resolution Weak Lensing Mass Reconstruction Method (Maximum likelihood approach; my naive eyes sensed a certain degree of relationship to the GREAT08 CHALLENGE)
[astro-ph:0804.1909] A. Leccardi and S. Molendi
Radial temperature profiles for a large sample of galaxy clusters observed with XMM-Newton
[astro-ph:0804.1964] C. Young & P. Gallagher
Multiscale Edge Detection in the Corona
[astro-ph:0804.2387] C. Destri, H. J. de Vega, N. G. Sanchez
The CMB Quadrupole depression produced by early fast-roll inflation: MCMC analysis of WMAP and SDSS data
[astro-ph:0804.2437] P. Bielewicz, A. Riazuelo
The study of topology of the universe using multipole vectors
[astro-ph:0804.2494] S. Bhattacharya, A. Kosowsky
Systematic Errors in Sunyaev-Zeldovich Surveys of Galaxy Cluster Velocities
[astro-ph:0804.2631] M. J. Mortonson, W. Hu
Reionization constraints from five-year WMAP data
[astro-ph:0804.2645] R. Stompor et al.
Maximum Likelihood algorithm for parametric component separation in CMB experiments (separate section for calibration errors)
[astro-ph:0804.2671] Peeples, Pogge, and Stanek
Outliers from the Mass–Metallicity Relation I: A Sample of Metal-Rich Dwarf Galaxies from SDSS
[astro-ph:0804.2716] H. Moradi, P.S. Cally
Time-Distance Modelling In A Simulated Sunspot Atmosphere (discusses systematic uncertainty)
[astro-ph:0804.2761] S. Iguchi, T. Okuda
The FFX Correlator
[astro-ph:0804.2742] M Bazarghan
Automated Classification of ELODIE Stellar Spectral Library Using Probabilistic Artificial Neural Networks
[astro-ph:0804.2827]S.H. Suyu et al.
Dissecting the Gravitational Lens B1608+656: Lens Potential Reconstruction (Bayesian)

Signal Processing and Bootstrap

hlee — Wed, 30 Jan 2008 06:33:25 +0000

Astronomers have developed their ways of processing signals almost independent to but sometimes collaboratively with engineers, although the fundamental of signal processing is same: extracting information. Doubtlessly, these two parallel roads of astronomers’ and engineers’ have been pointing opposite directions: one toward the sky and the other to the earth. Nevertheless, without an intensive argument, we could say that somewhat statistics has played the medium of signal processing for both scientists and engineers. This particular issue of IEEE signal processing magazine may shed lights for astronomers interested in signal processing and statistics outside the astronomical society.

IEEE Signal Processing Magazine Jul. 2007 Vol 24 Issue 4: Bootstrap methods in signal processing

This link will show the table of contents and provide links to articles; however, the access to papers requires IEEE Xplore subscription via libraries or individual IEEE memberships). Here, I’d like to attempt to introduce some articles and tutorials.

Special topic on bootstrap:
The guest editors (A.M. Zoubir & D.R. Iskander)^[1] open the issue by providing the rationale, the occasional invalid Gaussian noise assumption, and the consequential complex modeling in their editorial opening, Bootstrap Methods in Signal Processing. A practical approach has been Monte Carlo simulations but the cost of repeating experiments is problematic. The suggested alternative is the bootstrap, which provides tools for designing detectors for various signals subject to noise or interference from unknown distributions. It is said that the bootstrap is a computer-intensive tool for answering inferential questions and this issue serves as tutorials that introduce this computationally intensive statistical method to the signal processing community.

The first tutorial is written by those two guest editors: Bootstrap Methods and Applications, which begins with the list of bootstrap methods and emphasizes its resilience. It discusses the number of bootstrap samples to compensate a simulation (Monte Carlo) error to a statistical error and the sampling methods for dependent data with real examples. The flowchart from Fig. 9 provides the guideline for how to use the bootstrap methods as a summary.

The title of the second tutorial is Jackknifing Multitaper Spectrum Estimates (D.J. Thomson), which introduces the jackknife, multitaper estimates of spectra, and applying the former to the latter with real data sets. The author added the reason for his preference of jackknife to bootstrap and discussed the underline assumptions on resampling methods.

Instead of listing all articles from the special issue, a few astrostatistically notable articles are chosen:

Bootstrap-Inspired Techniques in Computational Intelligence (R. Polikar) explains the bootstrap for estimating errors, algorithms of bagging, boosting, and AdaBoost, and other bootstrap inspired techniques in ensemble systems with a discussion of missing.
Bootstrap for Empirical Multifractal Analysis (H. Wendt, P. Abry & S. Jaffard) explains block bootstrap methods for dependent data, bootstrap confidence limits, bootstrap hypothesis testing in addition to multifractal analysis. Due to the personal lack of familiarity in wavelet leaders, instead of paraphrasing, the article’s conclusion is intentionally replaced with quoting sentences:

First, besides being mathematically well-grounded with respect to multifractal analysis, wavelet leaders exhibit significantly enhanced statistical performance compared to wavelet coefficients. … Second, bootstrap procedures provide practitioners with satisfactory confidence limits and hypothesis test p-values for multifractal parameters. Third, the computationally cheap percentile method achieves already excellent performance for both confidence limits and tests.
Wild Bootstrap Test (J. Franke & S. Halim) discusses the residual-based nonparametric tests and the wild bootstrap for regression models, applicable to signal/image analysis. Their test checks the differences between two irregular signals/images.
Nonparametric Estimates of Biological Transducer Functions (D.H.Foster & K.Zychaluk) I like the part where they discuss generalized linear model (GLM) that is useful to expend the techniques of model fitting/model estimation in astronomy beyond gaussian and least square. They also mentioned that the bootstrap is simpler for getting confidence intervals.
Bootstrap Particle Filtering (J.V.Candy) It is a very pleasant reading for Bayesian signal processing and particle filter. It overviews MCMC and state space model, and explains resampling as a remedy to overcome the shortcomings of importance sampling in signal processing.
Compressive sensing. (R.G.Baranuik)

A lecture note presents a new method to capture and represent compressible signals at a rate significantly below the Nyquist rate. This method employs nonadaptive linear projections that preserve the structure of the signal;

I do wish this brief summary assists you selecting a few interesting articles.

They wrote a book, the bootstrap and its application in signal processing.

[Quote] Bootstrap and MCMC

hlee — Tue, 01 Jan 2008 00:48:59 +0000

The Bootstrap and Modern Statistics Brad Efron (2000), JASA Vol. 95 (452), p. 1293-1296.

If the bootstrap is an automatic processor for frequentist inference, then MCMC is its Bayesian counterpart.

Sometime in my second year of studying statistics, I said that bootstrap and MCMC are equivalent but reflect different streams in statistics. The response to this comment was ‘that’s nonsense.’ Although I forgot details of the circumstance, I was hurt and didn’t try to prove myself. After years, the occasion immediately floats on the surface upon seeing this sentence.

[ArXiv] 1st week, Nov. 2007

hlee — Fri, 02 Nov 2007 21:59:08 +0000

To be exact, the title of this posting should contain 5th week, Oct, which seems to be the week of EGRET. In addition to astro-ph papers, although they are not directly related to astrostatistics, I include a few statistics papers which may be profitable for astronomical data analysis.

[astro-ph:0710.4966]
Uncertainties of the antiproton flux from Dark Matter annihilation in comparison to the EGRET excess of diffuse gamma rays by Iris Gebauer
[astro-ph:0710.5106]
The dark connection between the Canis Major dwarf, the Monoceros ring, the gas flaring, the rotation curve and the EGRET excess of diffuse Galactic Gamma Rays by W. de Boer et.al.
[astro-ph:0710.5119]
Determination of the Dark Matter profile from the EGRET excess of diffuse Galactic gamma radiation by Markus Weber
[astro-ph:0710.5171]
Systematic Bias in Cosmic Shear: Beyond the Fisher Matrix by A.Amara and A. Refregier
[astro-ph:0710.5560]
Principal Component Analysis of the Time- and Position-Dependent Point Spread Function of the Advanced Camera for Surveys by M.J. Jee et.al.
[astro-ph:0710.5637]
A method of open cluster membership determination by G. Javakhishvili et.al.
[stat.CO:0710.5670]
An Elegant Method for Generating Multivariate Poisson Data by I. Yahav and G.Shmueli
[astro-ph:0710.5788]
Variations in Stellar Clustering with Environment: Dispersed Star Formation and the Origin of Faint Fuzzies by B. G. Elmegreen
[math.ST:0710.5749]
On the Laplace transform of some quadratic forms and the exact distribution of the sample variance from a gamma or uniform parent distribution by T.Royen
[math.ST:0710.5797]
The Distribution of Maxima of Approximately Gaussian Random Fields by Y. Nardi, D.Siegmund and B.Yakir
[astro-ph:0711.0177]
Maximum Likelihood Method for Cross Correlations with Astrophysical Sources by R.Jansson and G. R. Farrar
[stat.ME:0711.0198]
A Geometric Approach to Confidence Sets for Ratios: Fieller’s Theorem, Generalizations, and Bootstrap by U. von Luxburg and V. H. Franz

Astrostatistics: Goodness-of-Fit and All That!

hlee — Wed, 15 Aug 2007 02:17:00 +0000

During the International X-ray Summer School, as a project presentation, I tried to explain the inadequate practice of χ^2 statistics in astronomy. If your best fit is biased (any misidentification of a model easily causes such bias), do not use χ^2 statistics to get 1σ error for the 68% chance of capturing the true parameter.

Later, I decided to do further investigation on that subject and this paper came along: Astrostatistics: Goodness-of-Fit and All That! by Babu and Feigelson.

First, the authors pointed out that the χ^2 method 1) is inappropriate when errors are non-gaussian, 2) does not provide clear decision procedures between models with different numbers of parameters or between acceptable models, and 3) is possibly difficult to obtain confidence intervals on parameters when complex correlations between the parameters are present. As a remedy to the χ^2 method, they introduced distribution free tests, such as Kolmogorov-Smirnoff (K-S) test, Cramer-von Mises (C-vM) test, and Anderson-Darling (A-D) test. Among these distribution free tests, the K-S test is well known to astronomers but it has been ignored that the results from these tests become unreliable when the data come from a multivariate distribution. Furthermore, K-S tests fail when the data set is used for parameter estimation and computing the empirical distribution function.

The authors proposed resampling schemes to overcome the above shortcomings by showing both parametric and nonparametric bootstrap methods, and advanced to model comparison particularly when models are not nested. The best fit model can be chosen among other candidate models based on their distances (e.g. Kullback-Leibler distance) to the unknown hypothetical true model.