missing data

The notions of missing data are overall different between two communities. I tend to think missing data carry as good amount of information as observed data. Astronomers…I’m not sure how they think but my impression so far is that a missing value in one attribute/variable from a object/observation/informant, all other attributes related to that object become useless because that object is not considered in scientific data analysis or model evaluation process. For example, it is hard to find any discussion about imputation in astronomical publication or statistical justification of missing data with respect to inference strategies. On the contrary, they talk about incompleteness within different variables. Putting this vague argument with a concrete example, consider a catalog of multiple magnitudes. To draw a color magnitude diagram, one needs both color and magnitude. If one attribute is missing, that star will not appear in the color magnitude diagram and any inference methods from that diagram will not include that star. Nonetheless, one will trying to understand how different proportions of stars are observed according to different colors and magnitudes.

I guess this cultural difference is originated from the quality of data. Speaking of typical size of that data sets that statisticians handle, a child can count the number of data points. The size of astronomical data, only rounded numbers of stars in the catalog are discussed and dropping some missing data won’t affect the final results.

Introducing how statisticians handle missing data may benefit astronomers who handles small catalogs due to observational challenge in the survey. Such data with missing values can be put into statistically rigorous data analysis processes in stead of ad hoc procedures of obtaining complete cases that risk throwing many data points.

In statistics, utilizing information of missing data enhances information toward the direction that the inference method tries to retrieve. Despite larger, it’s better to have error bars than nothing. My question is what are statistical proposals for astronomers to handle missing data? Even though I want to find such list, instead, I give a few somewhat nontechnical papers that explain the following missing data types in statistics and a few statistics books/articles that statisticians often cite.

  • Data mining and the impact of missing data by M.L. Brown and J.F.Kros, Industrial Management and Data Systems (2003) Vol. 103, No. 8, pp.611-621
  • Missing Data: Our View of the State of the Art by J.L.Schafer and J.W.Graham, Psychological Methods (2002) Vol.7, No. 2, pp. 147-177
  • Missing Data, Imputation, and the Bootstrap by B. Efron, JASA (1984) 89 426 p. 463- and D.B.Rubin’s comment
  • The multiple imputation FAQ page (web) by J. Shafer
  • Statistical Analysis with Missing Data by R.J.A. Little and D.B.Rubin (2002) 2nd ed. New York: Wiley.
  • The Curse of the Missing Data (web) by Yong Kim
  • A Review of Methods for Missing Data by T.D.Pigott, Edu. Res. Eval. (2001) 7(4),pp.353-383 (survey of missing data analysis strategies and illustration with “asthma data”)

Pigott discusses missing data methods to general audience in plain terms under the following categories: complete-cases, available-cases, single-value imputation, and more recent model-based methods, maximum likelihood for multivariate normal data, and multiple imputation. Readers of craving more information see Schafer and Graham or books by Schafer (1997) and Little and Rubin (2002).

Most introductory articles begin with common assumptions like missing at random (MAR) or missing at completely random (MCAR) but these seem not apply to typical astronomical data sets (I don’t know exactly why yet – I cannot provide counter examples to prove – but that’s what I have observed and was told). Currently, I like to find ways to link between statistical thinking about missing data and modeling to astronomical data of missing through discovering commonality in their missing properties). I hope you can help me and others of such efforts. For your information, the following are the short definitions of these assumptions:

  • data missing at random : missing for reasons related to completely observed variables in the data set
  • data missing completely at random : the complete cases are a random sample of the originally identified set of cases
  • non-ignorable missing data : the reasons for the missing observations depend on the values of those variables.
  • outliers treated as missing data
  • the assumption of an ignorable response mechanism.

Statistical researches are conducted traditionally under the circumstance that complete data are available and the goal is characterizing inference results from the missing data analysis methods by comparing results from data with complete information and dropping observations on the variables of interests. Simulations enable to emulate these different kind of missing properties. A practical astronomer may raise a question about such comparison and simulating missing data. In real applications, such step is not necessary but for the sake of statistical/theoretical authenticity/validation and approval of new missing data analysis methods, the comparison between results from complete data and missing data is unavoidable.

Against my belief that statistical analysis with missing data is applied universally, it seems like only regression type strategy can cope with missing data despite the diverse categories of missing data, so far. Often cases in multivariate data analysis in astronomy, the relationship between response variables and predictors is not clear. More frequently, responses do not exist but the joint distribution of given variables is more cared. Without knowing data generating distribution/model, analyzing arbitrarily built models with missing data for imputation and for estimation seems biased. This gap of handling different data types is the motivation of introducing statistical missing data analysis to astronomers, but statistical strategies of handing missing data may be seen very limited. I believe, however, some “new” concepts in missing data analysis approaches can be salvaged like the assumptions for analyzing data with underlying multivariate normal distribution, favored by astronomers many of whom apply principle component analysis (PCA) nowadays. Understanding conditions for multivariate normal distribution and missing data more rigorously leads astronomers to project their data analysis onto the regression analysis space since numerous survey projects in addition to the emergence of new catalogs pose questions of relationships among observed variables or estimated parameters. The broad areas of regression analysis embraces missing data in various ways and likewise, vast astronomical surveys and catalogs need to move forward in terms of adopting proper data analysis tools to include missing data since instead of laws of physics, finding relationships among variables empirically is the scientific objective of surveys, and missing data are not ignorable. I think that tactics in missing data analysis will allow steps forward in astronomical data analysis and its statistical inference.

Statisticians or other scientists utilizing statistics might have slightly different ways to call the strategies of missing data analysis, my way of putting the strategies of missing data analysis described in above texts is as follows:

  • complete case analysis (caveat: relatively few cases may be left for the analysis and MCAR is assumed),
  • available case analysis (pairwise deletion, delete selected variables. caveat: correlations in variable pairs)
  • single-value imputation (typically mean value is imputed, causing biased results and underestimated variance, not recommended. )
  • maximum likelihood, and
  • multiple imputation (the last two are based on two assumptions: multivariate normal and ignorable missing data mechanism)

and the following are imputation strategies:

  • mean substituion,
  • case substitution (scientific knowledge authorizes substitution),
  • hot deck imputation (external sources imputes imputation),
  • cold deck imputation (values drawn from the next most similar case but difficulty in defining what is “similar”),
  • regression imputation (prediction with independent variables and mean imputation is a special case) and
  • multiple imputation

Some might prefer the following listing (adopted from Gelman and Brown’s regression analysis book):

  • simple missing data approaches that retain all the data
    1. -mean imputation
    2. -last value carried forward
    3. -using information from related observation
    4. -indicator variables for missingness of categorical predictors
    5. -indicator varibbles for missingness of continuous predictors
    6. -imputation based on logical values
  • random imputation of a single variables
  • imputation of several missing variables
  • model based imputation
  • combining inferences from multiple imputation

Explicit assumptions are acknowledged through statistical missing data analysis compared to subjective data processing toward complete data set. I often see discrepancies between plots from astronomical journals and linked catalogs where missing data including outliers reside but through the subjective data cleaning step they do not appear in plots. On the other hand, statistics exclusively explains assumptions and conditions of missing data. However, I don’t know what is proper or correct from scientific viewpoints. Such explication does not exist and judgments on assumptions on missing data and processing them left to astronomers. Moreover, astronomers have the advantages like knowledge in physics for imputing data more suitably and subtly.

Schafer and Graham described, with or without missing data, the goal of a statistical procedure should be to make valid and efficient inferences about a population of interest — not to estimate, predict, or recover missing observations nor to obtain the same results that we would have seen with complete data.

The following quote from the above web link (Y. Kim) says more.

Dealing with missing data is a fact of life, and though the source of many headaches, developments in missing data algorithms for both prediction and parameter estimation purposes are providing some relief. Still, they are no substitute for critical planning. When it comes to missing data, prevention is the best medicine.

Missing entries in astronomical catalogs are unpreventable; therefore, one needs statistically improved strategies more than ever because of the increase volume of surveys and catalogs proportionally many missing data reside. Or current methods using complete data (getting rid of all observations with at least one missing entry) could be the only way to go. There are more rooms left to discuss strategies case by case, which would come in future post. This one is already too long.

2 Comments
  1. vlk:

    Very useful summary, Hyunsook, lots of food for thought. A couple of comments from an astronomical perspective. We usually have data missing due to thresholding of some observable (e.g., intensity — faint sources will be missed, see Eddington Bias), which seems to me to fall under none of the categories you mentioned. We also have MCAR in the time domain when an observation drops off because the object sets for the day, or because the telescope goes into Earth eclipse, etc.

    btw, wavdetect uses iterative mean imputation to determine the background under a source. Other wavelet-based astro detection algorithms (such as pwdetect) use zero imputation. (i.e., find outliers, excise them from the data, and replace with either mean of the surrounding, or with zero, and iterate until convergence.)

    10-27-2008, 12:46 pm
  2. Alex:

    In the astrostats group, we usually use a data augmentation approach for problems of incompleteness. That is, we alternate between drawing from the distribution of the missing data given the last set of parameters drawn and drawing from the distributions of the parameters given the complete (observed + missing) data. Vinay is correct in that it does not fit into the previously mentioned categories very well; data augmentation is really a Bayesian procedure, whereas your lists are more classically focused (and less focused on simulation methods). Data augmentation is essentially multiple imputation, but we typically do not use that term.

    For a nice example of the data augmentation approach to handling missing data, I would take a look at Nondas’s poster on the Log(N)-Log(S) estimation problem. It’s a nice, relatively direct application of the procedure to a problem involving incompleteness.

    10-27-2008, 6:47 pm
Leave a comment