The AstroStat Slog » survey

Astroart Survey

vlk — Sun, 02 Nov 2008 12:42:01 +0000

Astronomy is known for its pretty pictures, but as Joe the Astronomer would say, those pretty pictures don’t make themselves. A lot of thought goes into maximizing scientific content while conveying just the right information, all discernible at a single glance. So the hardworkin folks at Chandra want your help in figuring out what works and how well, and they have set up a survey at http://astroart.cfa.harvard.edu/. Take the survey, it is both interesting and challenging!

missing data

hlee — Mon, 27 Oct 2008 13:24:22 +0000

The notions of missing data are overall different between two communities. I tend to think missing data carry as good amount of information as observed data. Astronomers…I’m not sure how they think but my impression so far is that a missing value in one attribute/variable from a object/observation/informant, all other attributes related to that object become useless because that object is not considered in scientific data analysis or model evaluation process. For example, it is hard to find any discussion about imputation in astronomical publication or statistical justification of missing data with respect to inference strategies. On the contrary, they talk about incompleteness within different variables. Putting this vague argument with a concrete example, consider a catalog of multiple magnitudes. To draw a color magnitude diagram, one needs both color and magnitude. If one attribute is missing, that star will not appear in the color magnitude diagram and any inference methods from that diagram will not include that star. Nonetheless, one will trying to understand how different proportions of stars are observed according to different colors and magnitudes.

I guess this cultural difference is originated from the quality of data. Speaking of typical size of that data sets that statisticians handle, a child can count the number of data points. The size of astronomical data, only rounded numbers of stars in the catalog are discussed and dropping some missing data won’t affect the final results.

Introducing how statisticians handle missing data may benefit astronomers who handles small catalogs due to observational challenge in the survey. Such data with missing values can be put into statistically rigorous data analysis processes in stead of ad hoc procedures of obtaining complete cases that risk throwing many data points.

In statistics, utilizing information of missing data enhances information toward the direction that the inference method tries to retrieve. Despite larger, it’s better to have error bars than nothing. My question is what are statistical proposals for astronomers to handle missing data? Even though I want to find such list, instead, I give a few somewhat nontechnical papers that explain the following missing data types in statistics and a few statistics books/articles that statisticians often cite.

Data mining and the impact of missing data by M.L. Brown and J.F.Kros, Industrial Management and Data Systems (2003) Vol. 103, No. 8, pp.611-621
Missing Data: Our View of the State of the Art by J.L.Schafer and J.W.Graham, Psychological Methods (2002) Vol.7, No. 2, pp. 147-177
Missing Data, Imputation, and the Bootstrap by B. Efron, JASA (1984) 89 426 p. 463- and D.B.Rubin’s comment
The multiple imputation FAQ page (web) by J. Shafer
Statistical Analysis with Missing Data by R.J.A. Little and D.B.Rubin (2002) 2nd ed. New York: Wiley.
The Curse of the Missing Data (web) by Yong Kim
A Review of Methods for Missing Data by T.D.Pigott, Edu. Res. Eval. (2001) 7(4),pp.353-383 (survey of missing data analysis strategies and illustration with “asthma data”)

Pigott discusses missing data methods to general audience in plain terms under the following categories: complete-cases, available-cases, single-value imputation, and more recent model-based methods, maximum likelihood for multivariate normal data, and multiple imputation. Readers of craving more information see Schafer and Graham or books by Schafer (1997) and Little and Rubin (2002).

Most introductory articles begin with common assumptions like missing at random (MAR) or missing at completely random (MCAR) but these seem not apply to typical astronomical data sets (I don’t know exactly why yet – I cannot provide counter examples to prove – but that’s what I have observed and was told). Currently, I like to find ways to link between statistical thinking about missing data and modeling to astronomical data of missing through discovering commonality in their missing properties). I hope you can help me and others of such efforts. For your information, the following are the short definitions of these assumptions:

data missing at random : missing for reasons related to completely observed variables in the data set
data missing completely at random : the complete cases are a random sample of the originally identified set of cases
non-ignorable missing data : the reasons for the missing observations depend on the values of those variables.
outliers treated as missing data
the assumption of an ignorable response mechanism.

Statistical researches are conducted traditionally under the circumstance that complete data are available and the goal is characterizing inference results from the missing data analysis methods by comparing results from data with complete information and dropping observations on the variables of interests. Simulations enable to emulate these different kind of missing properties. A practical astronomer may raise a question about such comparison and simulating missing data. In real applications, such step is not necessary but for the sake of statistical/theoretical authenticity/validation and approval of new missing data analysis methods, the comparison between results from complete data and missing data is unavoidable.

Against my belief that statistical analysis with missing data is applied universally, it seems like only regression type strategy can cope with missing data despite the diverse categories of missing data, so far. Often cases in multivariate data analysis in astronomy, the relationship between response variables and predictors is not clear. More frequently, responses do not exist but the joint distribution of given variables is more cared. Without knowing data generating distribution/model, analyzing arbitrarily built models with missing data for imputation and for estimation seems biased. This gap of handling different data types is the motivation of introducing statistical missing data analysis to astronomers, but statistical strategies of handing missing data may be seen very limited. I believe, however, some “new” concepts in missing data analysis approaches can be salvaged like the assumptions for analyzing data with underlying multivariate normal distribution, favored by astronomers many of whom apply principle component analysis (PCA) nowadays. Understanding conditions for multivariate normal distribution and missing data more rigorously leads astronomers to project their data analysis onto the regression analysis space since numerous survey projects in addition to the emergence of new catalogs pose questions of relationships among observed variables or estimated parameters. The broad areas of regression analysis embraces missing data in various ways and likewise, vast astronomical surveys and catalogs need to move forward in terms of adopting proper data analysis tools to include missing data since instead of laws of physics, finding relationships among variables empirically is the scientific objective of surveys, and missing data are not ignorable. I think that tactics in missing data analysis will allow steps forward in astronomical data analysis and its statistical inference.

Statisticians or other scientists utilizing statistics might have slightly different ways to call the strategies of missing data analysis, my way of putting the strategies of missing data analysis described in above texts is as follows:

complete case analysis (caveat: relatively few cases may be left for the analysis and MCAR is assumed),
available case analysis (pairwise deletion, delete selected variables. caveat: correlations in variable pairs)
single-value imputation (typically mean value is imputed, causing biased results and underestimated variance, not recommended. )
maximum likelihood, and
multiple imputation (the last two are based on two assumptions: multivariate normal and ignorable missing data mechanism)

and the following are imputation strategies:

mean substituion,
case substitution (scientific knowledge authorizes substitution),
hot deck imputation (external sources imputes imputation),
cold deck imputation (values drawn from the next most similar case but difficulty in defining what is “similar”),
regression imputation (prediction with independent variables and mean imputation is a special case) and
multiple imputation

Some might prefer the following listing (adopted from Gelman and Brown’s regression analysis book):

simple missing data approaches that retain all the data

-mean imputation
-last value carried forward
-using information from related observation
-indicator variables for missingness of categorical predictors
-indicator varibbles for missingness of continuous predictors
-imputation based on logical values

random imputation of a single variables
imputation of several missing variables
model based imputation
combining inferences from multiple imputation

Explicit assumptions are acknowledged through statistical missing data analysis compared to subjective data processing toward complete data set. I often see discrepancies between plots from astronomical journals and linked catalogs where missing data including outliers reside but through the subjective data cleaning step they do not appear in plots. On the other hand, statistics exclusively explains assumptions and conditions of missing data. However, I don’t know what is proper or correct from scientific viewpoints. Such explication does not exist and judgments on assumptions on missing data and processing them left to astronomers. Moreover, astronomers have the advantages like knowledge in physics for imputing data more suitably and subtly.

Schafer and Graham described, with or without missing data, the goal of a statistical procedure should be to make valid and efficient inferences about a population of interest — not to estimate, predict, or recover missing observations nor to obtain the same results that we would have seen with complete data.

The following quote from the above web link (Y. Kim) says more.

Dealing with missing data is a fact of life, and though the source of many headaches, developments in missing data algorithms for both prediction and parameter estimation purposes are providing some relief. Still, they are no substitute for critical planning. When it comes to missing data, prevention is the best medicine.

Missing entries in astronomical catalogs are unpreventable; therefore, one needs statistically improved strategies more than ever because of the increase volume of surveys and catalogs proportionally many missing data reside. Or current methods using complete data (getting rid of all observations with at least one missing entry) could be the only way to go. There are more rooms left to discuss strategies case by case, which would come in future post. This one is already too long.

survey and design of experiments

hlee — Wed, 01 Oct 2008 20:16:24 +0000

People of experience would say very differently and wisely against what I’m going to discuss now. This post only combines two small cross sections of each branch of two trees, astronomy and statistics.

When it comes to survey, the first thing comes in my mind is the census packet although I only saw it once (an easy way to disguise my age but this is true) but the questionaire layouts are so carefully and extensively done so as to give me a strong impression. Such survey is designed prior to collecting data so that after collection, data can be analyzed according to statistical methodology suitable to the design of the survey. Strategies for response quantification is also included (yes/no for 0/1, responses in 0 to 10 scale, bracketing salaries, age groups, and such, handling missing data) for elaborated statistical analysis to avoid subjective data transformation and arbitrary outlier eliminations.

In contrast, survey in astronomy means designing a mesh, not questionaires, unable to be transcribed into statistical models. This mesh has multiple layers like telescope, detector, and source detection algorithm, and eventually produces a catalog. Designing statistical methodology is not a part of it that draws interpretable conclusion. Collecting what goes through that mesh is astronomical survey. Analyzing the catalog does not necessarily involve sophisticated statistics but often times adopts chi-sq fittings and cast aways of unpleasant/uninteresting data points.

As other conflicts in jargon, –a simplest example is H_o: I used to know it as Hubble constant but now, it is recognized first as a notation for a null hypothesis — survey has been one of them and like the measurement error, some clarification about the term, survey is expected to be given by knowledgeable astrostatisticians to draw more statisticians involvement in grand survey projects soon to come. Luckily, the first opportunity will be given soon at the Special Session: Meaning from Surveys and Population Studies: BYOQ during the 213 AAS meeting, at Long Beach, California on Jan. 5th, 2009.

Classification and Clustering

hlee — Thu, 18 Sep 2008 23:48:43 +0000

Another deduced conclusion from reading preprints listed in arxiv/astro-ph is that astronomers tend to confuse classification and clustering and to mix up methodologies. They tend to think any algorithms from classification or clustering analysis serve their purpose since both analysis algorithms, no matter what, look like a black box. I mean a black box as in neural network, which is one of classification algorithms.

Simply put, classification is regression problem and clustering is mixture problem with unknown components. Defining a classifier, a regression model, is the objective of classification and determining the number of clusters is the objective of clustering. In classification, predefined classes exist such as galaxy types and star types and one wishes to know what prediction variables and their functional allow to separate Quasars from stars without individual spectroscopic observations by only relying on handful variables from photometric data. In clustering analysis, there is no predefined class but some plots visualize multiple populations and one wishes to determine the number of clusters mathematically not to be subjective in concluding remarks saying that the plot shows two clusters after some subjective data cleaning. A good example is that as photons from Gamma ray bursts accumulate, extracting features like F_{90} and F_{50} enables scatter plots of many GRBs, which eventually led people believe there are multiple populations in GRBs. Clustering algorithms back the hypothesis in a more objective manner opposed to the subjective manner of scatter plots with non statistical outlier elimination.

However, there are challenges to make a clear cut between classification and clustering both in statistics and astronomy. In statistics, missing data is the phrase people use to describe this challenge. Fortunately, there is a field called semi-supervised learning to tackle it. (Supervised learning is equivalent to classification and unsupervised learning is to clustering.) Semi-supervised learning algorithms are applicable to data, a portion of which has known class types and the rest are missing — astronomical catalogs with unidentified objects are a good candidate for applying semi-supervised learning algorithms.

From the astronomy side, the fact that classes are not well defined or subjective is the main cause of this confusion in classification and clustering and also the origin of this challenge. For example, will astronomer A and B produce same results in classifying galaxies according to Hubble’s tuning fork?^[1] We are not testing individual cognitive skills. Is there a consensus to make a cut between F9 stars and G0 stars? What make F9.5 star instead of G0? With the presence of error bars, how one is sure that the star is F9, not G0? I don’t see any decision theoretic explanation in survey papers when those stellar spectral classes are presented. Classification is generally for data with categorical responses but astronomer tend to make something used to be categorical to continuous and still remain to apply the same old classification algorithms designed for categorical responses.

From a clustering analysis perspective, this challenge is caused by outliers, or peculiar objects that do not belong to the majority. The size of this peculiar objects can make up a new class that is unprecedented before. Or the number is so small that a strong belief prevails to discard these data points, regarded as observational mistakes. How much we can trim the data with unavoidable and uncontrollable contamination (remember, we cannot control astronomical data as opposed to earthly kinds)? What is the primary cause defining the number of clusters? physics, statistics, astronomers’ experience in processing and cleaning data, …

Once the ambiguity in classification, clustering, and the complexity of data sets is resolved, another challenge is still waiting. Which black box? For the most of classification algorithms, Pattern Recognition and Machine Learning by C. Bishop would offer a broad spectrum of black boxes. Yet, the book does not include various clustering algorithms that statisticians have developed in addition to outlier detection. To become more rigorous in selecting a black box for clustering analysis and outlier detection, one is recommended to check,

Clustering Analysis by Everitt, Landau, and Leese
Data Clustering: Theory, Algorithms, and Applications by Gan, Ma, and Wu
collection of articles and presentation files on nonparametric multivariate analysis by Robert Serfling (Yes, the author of the classical book, Approximation Theorems of Mathematical Statistics), particularly about data depth and outlier detection and
robust statistics by Peter Huber

For me, astronomers tend to be in a haste owing to the pressure of publishing results immediately after data release and to overlook suitable methodologies for their survey data. It seems that there is no time for consulting machine learning specialists to verify the approaches they adopted. My personal prayer is that this haste should not be settled as a trend in astronomical survey and large data analysis.

Check out the project, GALAXY ZOO

[Book] pattern recognition and machine learning

hlee — Tue, 16 Sep 2008 19:20:43 +0000

A nice book by Christopher Bishop.
While I was reading abstracts and papers from astro-ph, I saw many applications of algorithms from pattern recognition and machine learning (PRML). The frequency will increase as large scale survey projects numerate, where recommending a good textbook or a reference in the field seems timely.

Survey and population studies generally invite large data sets. Any discussion about individual objects from that survey is an indication that those objects are outliers with respect to the objects in the catalog, created from survey and population studies. These outliers are the objects deserving strong spotlights, in contrast to the notion that outliers are useless. Other than studies about outliers, survey and population studies generally involve machine learning and pattern recognition, or supervised learning and unsupervised learning, or classification and clustering, or statistical learning. Whatever jargon you choose to use, the book overviews most popular machine learning methods extensively with examples, nice illustrations, and concise math. Upon understanding characteristics of the catalog such as dimensions, sample size, independent variable, dependent variable, missing values, sampling (volume limited, magnitude limited, incompleteness), measurement errors, scatter plots, and so on, as a second step to summarize the large data as a whole, the book could offer proper approaches based on your data analysis objective in a statistical sense – in terms of summarizing data.

Click here to access the book website for various resources including a few book chapters, retailer links, examples, and solutions. One of reviews you can check.

A lesson from reading arxiv/astro-ph during the past year is that astronomers must become interdisciplinary particularly those in surveys and creating catalogs. From the information retrieval viewpoint, some rudimentary education about pattern recognition and machine learning is a must as I personally think basic statistics and probability theory should be offered to young astronomers (like astrostatistics summer school at Penn State). While attending graduate school, I saw non stat majors taking statistics classes, except students from astronomy or physics. To confirm this hypothesis, I took computational physics to learn how astronomers and physicists handle data with uncertainty. Although it was one of my favorite classes, the course was quite off from statistics. (Game theory was the most statistically relevant subject.) Hence, I think not many astronomy departments offer practical statistics courses or machine learning and therefore, recommending good and modern textbooks related to (statistical) data analysis can be beneficial to self teaching astronomers. I hope my reasoning is in the right track.

working together to tackle hard problems in astronomy

hlee — Fri, 01 Feb 2008 17:45:04 +0000

This is an edited email copy of Colloquium Announcement from Tufts University, MA. A must go for those live in Medford and Somerville, where Tufts Univ. is located and its vicinity.

Subject : Special Joint CS and Physics Colloquium
Title : How Astronomers, Computer Scientists and Statisticians are working together to tackle hard problems in astronomy
Speaker: Pavlos Protopapas
Date : Thursday February 7
Time : 3:15 pm
Place : Nelson Auditorium, Anderson Hall (Click for the map, 200 College Ave, Medford, MA, I think)
Abstract:
New astronomical surveys such as Pan-STARRS and LSST are under development and will collect petabytes of data. These surveys will image large areas of sky repeatedly to great depth, and will detect vast numbers of moving, variably bright, and transient objects. The data product of these surveys is series of observations taken over time, or light-curves.

The IIC has established an inter-disciplinary Center for Time Series with an immediate focus on astronomy. I will present three research topics currently being pursued at the IIC that require expertise from astronomy, computer science and statistics. These are: identifying novel astronomical phenomena in large light-curve datasets, searching for rare phenomena such as extra-solar planets, and efficiently searching for significant events such as occultations of stars by small objects in the outer reaches of our solar system.

Pavlos Protopapas is a senior scientist at the IIC and Harvard-Smithsonian Center for Astrophysics. His research interests spans the outer solar system, extra-solar planets and gravitational lensing. He specializes in analyzing large collections of astronomical data, with a toolbox drawn from data-mining, computer science and statistics.

[ArXiv] Geneva-Copenhagen Survey, July 13, 2007

hlee — Sun, 05 Aug 2007 05:25:45 +0000

From arxiv/astro-ph:0707.1891v1
The Geneva-Copenhagen Survey of the Solar neighborhood II. New uvby calibrations and rediscussion of stellar ages, the G dwarf problem, age-metalicity diagram, and heating mechanisms of the disk by Holmberg, Nordstrom, and Andersen

Researchers, including scientists from CHASC, working on color magnitude diagrams to infer ages, metalicities, temperatures, and other physical quantities of stars and stellar clusters may find this paper useful.

Methodologies for temperature calibration (fairly accurate estimation from V-K and a new calibration from b-y), metallicity calibration, absolute magnitude/distance calibration, interstellar reddening, and stellar ages were presented with reviews on stellar models and their parameters, astrophysical calibration errors, metalicity distribution function, age-metallicity diagram, age-velocity relation, and thin disk vs thick disk. It seems like that the previous methodologies for F and G stars need to be revised.

Despite my incapability of full understanding the theory of star formation history and the uncertainties of calibrations (looks like that all go toward regression problems to me), this paper fully manifests the complexity of the stellar models and their calibration process. From a statistical perspective, the complexity of the stellar models and calibrations comes from many predictors and only a few response variables with uncertainties (even more they are heteroskedastic). Furthermore, the relationship between predictors and response variables is sparsely known, which makes fitting the model to a star or a stellar cluster or inferencing physical information from them difficult. The mapping is considered to be highly structured black box and its required careful investigations.

I’d rather end this very technical preprint by citing a sentence:

The question of interest is therefore how well these relations and their intrinsic scatter can be determined from the observations

[hlee: Instead of determining, modeling seems to reflect the flexibilities and uncertainties.]

Photometric Redshifts

hlee — Wed, 25 Jul 2007 06:28:40 +0000

Since I began to subscribe arxiv/astro-ph abstracts, from an astrostatistical point of view, one of the most frequent topics has been photometric redshifts. This photometric redshift has been a popular topic as the catalog of remote photometric object observation multiplies its volume and sky survey projects in multiple bands lead to virtual observatories (VO – will discuss in the later posting). Just searching by photometric redshifts in google scholar and arxiv.org provides more than 2000 articles since 2000.

Quantifying redshifts is one of the key astronomical measures to identify the type of objects as well as to provide their distance. Typically, measuring redshifts requires spectral data, which are quite expensive in many aspects compared to photometric data. Let me explain a little what are spectral data and photometric data to enhance understandings for non astronomers.

Collecting photometric data starts from taking pictures with different filters. Through blue, yellow, red optical filters, or infrared, ultra-violet, X-ray filters, objects look different (or have different light intensity) and various astronomical objects can be identify via investigating pictures of many filter combinations. On the other hand, collecting spectral data starts from dispersing light through a specially designed prism. Because of this light dispersion, it takes longer to collect lights from a object and the smaller number of objects are recorded in a picture plate compared to collecting photometric data. A nice feature of this expensive spectral data is providing the physical condition of the object directly: first, the distance by the relative spectral line shifts of spectral lines; second, abundance (the metallic composition of the object), temperature, type of the object also from spectral lines. Therefore, utilizing photometric data to infer measures normally available from spectral data is a very attractive topic in astronomy.

However, there are many challenges. The massive volume of data and sampling bias*, like Malmquist bias (wiki) and Lutz-Kelker bias, hinder traditional regression techniques, where numerous statistical and machine learning methods have been introduced to make most of these photometric data to infer distances economically and quickly.

*((For a reference regarding these biases and astronomical distances, please check Distance Estimation in Cosmology by
Hendry, M. A. and Simmons, J. F. L., Vistas in Astronomy, vol. 39, Issue 3, pp.297-314.))

[ArXiv] Bayesian Star Formation Study, July 13, 2007

hlee — Mon, 16 Jul 2007 19:31:13 +0000

From arxiv/astro-ph:0707.2064v1
Star Formation via the Little Guy: A Bayesian Study of Ultracool Dwarf Imaging Surveys for Companions by P. R. Allen.

I rather skip all technical details on ultracool dwarfs and binary stars, reviews on star formation studies, like initial mass function (IMF), astronomical survey studies, which Allen gave a fair explanation in arxiv/astro-ph:0707.2064v1 but want to emphasize that based on simple Bayes’ rule and careful set-ups for likelihoods and priors according to data (ultracool dwarfs), quite informative conclusions were drawn:

the peak at q~1 is significant,
lack of companions with a distance greater than 15-20 A.U. (a unit indicates the distance between the Sun and the Earth),
less binary stars with later spectral types,
inconsistency of undetected low mass ratio systems to the current data, and
30% spectroscopic binaries are from ultracool binaries.

Before, asking for observational efforts for improvements, it is commented 75% as the the upper limit of the ultracool binary population.

[ArXiv] Spectroscopic Survey, June 29, 2007

hlee — Mon, 02 Jul 2007 22:07:39 +0000

From arXiv/astro-ph:0706.4484

Spectroscopic Surveys: Present by Yip. C. overviews recent spectroscopic sky surveys and spectral analysis techniques toward Virtual Observatories (VO). In addition that spectroscopic redshift measures increase like Moore’s law, the surveys tend to go deeper and aim completeness. Mainly elliptical galaxy formation has been studied due to more abundance compared to spirals and the galactic bimodality in color-color or color-magnitude diagrams is the result of the gas-rich mergers by blue mergers forming the red sequence. Principal component analysis has incorporated ratios of emission line-strengths for classifying Type-II AGN and star forming galaxies. Lyα identifies high z quasars and other spectral patterns over z reveal the history of the early universe and the characteristics of quasars. Also, the recent discovery of 10 satellites to the Milky Way is mentioned.

Spectral analyses take two approaches: one is the model based approach taking theoretical templates, known for its flaws but straightforward extractions of physical parameters, and the other is the empirical approach, useful for making discoveries but difficult in the analysis interpretation. Neither of them has substantial advantage to the other. When it comes to fitting, Chi-square minimization has been dominant but new methodologies are under developing. For spectral classification problems, principal component analysis (Karlhunen-Loeve transformation), artificial neural network, and other machine learning techniques have been applied.

In the end, the author reports statistical and astrophysical challenges in massive spectroscopic data of present days: 1. modeling galaxies, 2. parameterizing star formation history, 3. modeling quasars, 4. multi-catalog based calibration (separating systematic and statistics errors), 5. estimating parameters, which would be beneficial to VO, of which objective is the unification of data access.