# Topics in Astrostatistics

## Statistics 310, Harvard University Statistics 281, University of California, Irvine

### Fall/Winter/Spring 2009-2010

##### www.courses.fas.harvard.edu/~stat310/

 Instructor Prof. Meng Xiao Li Schedule Tuesdays 11:30 AM Location CfA 60 Garden P-226 (Tea Room)

Presentations
Alanna Connors
(with Aneta and Vinay)
08 Sep 2009
Introduction to Astronomy for Statisticians
[.pdf]
Movies:
Full Sun rotating (Hinode/XRT) [.mov]
flaring loops (Hinode/XRT) [.mov]
transit of Mercury (Hinode/XRT) [.mov]
Gamma-ray sky (Fermi) [.m4v]
Black Hole at center of Milky Way (ESO/VLT) [.m4v]
Discovery of Kuiper Belt object (APOD) [.gif]

Brandon Kelly (CfA)
06 Oct 2009
Hierarchical modeling of astronomical images and uncertainty in truncated data sets
Abstract: I will discuss two astronomical problems that I am working on, and the statistical issues surrounding them. The first involves the analysis of brightness images of astronomical objects with the goal of recovering an 'image' of the physical properties of these objects. Here, the primary goal is to infer from the brightness images how the physical properties of the objects are correlated and spatially distributed. The analysis is complicated by a two-level error structure, having both additive and multiplicative errors, making some of the model parameters nearly degenerate. The second problem involves density estimation of a truncated data set, where the truncation arises due to a data selection efficiency that varies with an astronomical object's brightness (e.g., fainter things are more difficult to detect). When the selection efficiency is known, the analysis is straightfoward. However, when the selection efficiency has some uncertainty in it, the likelihood function or posterior distribution can be unstable. Currently, there does not appear to be methods for accounting for uncertainty in the data selection efficiency.
Presentation [.pdf]

Nathan Stein, Paul Baines
20 Oct 2009
Markov Chain Monte Carlo Methods for Fitting Computer Models for Stellar Evolution (in three parts)
Nathan Stein (Dept. of Statistics, Harvard U)
Abstract: Bayesian analysis of the evolution of star clusters presents several computational challenges. Because physics-based models of stellar evolution are implemented as computer models and are not available in closed form, none of the conditional posterior distributions are traditional named distributions. Moreover, the posterior distributions of interest are high dimensional, strongly correlated, and often multimodal. Markov chain Monte Carlo algorithms can generate samples from these posterior distributions, but creating reasonably efficient sampling algorithms requires advanced techniques.
slides [.pdf]
Paul Baines (Dept. of Statistics, Harvard U)
Abstract: The analysis of photometric data for stellar clusters provides an example of both the statistical and computational challenges present in many Astronomy applications. Typically, properties of stellar clusters are estimated using Color-Magnitude Diagrams, whereby the observed data are often simply compared to what one would expect under a theoretical mapping from the model parameters to the observed data. This mapping is determined by a set of isochrone tables, listing the expected photometric measurements for a given set of input parameters.
To address many of the substantive questions in a coherent statistical manner, we present a flexible hierarchical Bayesian model for the analysis of stellar populations. The computation for the model is done via Markov Chain Monte Carlo (MCMC), the standard tool of choice for Bayesian computation. Both the complex dependence structure and the peculiar nature of the isochrone mapping, however, present a formidable challenge to standard statistical computation methods. In the spirit of the Ancillary Sufficient Interweaving Scheme (ASIS) of Yu & Meng (presented in a separate talk by Yu), we show how competing parameterizations can be constructed and combined to help overcome the weaknesses of individual schemes, and drastically improve the efficiency and reliability of the computation.
slides [.pdf]

David Stenning / Jin Xu / Alex Blocker
3 Nov 2009
David Stenning (UCI)
Automatic Classification of Sunspot Groups Using SOHO/MDI Magnetogram and White-Light Images
Abstract: Sunspot groups are classified into four types: alpha, beta, beta-gamma, and beta-gamma-detla. Currently, most sunspot group classification is done manually by experts. This is a lengthy, labor-intensive, and somewhat subjective process, necessitating the need for an automatic and accurate procedure. We intend to use SOHO/MDI magnetogram and white-light images to detect and classify sunspots into the appropriate group. The first step in this process involves the extraction of white light data that corresponds to the magnetogram images we have available. I will discuss the progress I have made so far and address questions regarding the automation of the extraction routine.
presentation slides [.pdf]

Jin Xu (UCI)
Solar DEMs
Abstract: The wavelength distribution of light emitted from different regions of the sun contains clues as to how the composition and temperature of the sun varies across its surface. Decoding this information requires sophisticated statistical techniques and detailed quantum physical calculations. Data consists of images of the sun that record its intensity in each of a number of wavelength bands. This "talk" will consist of a conversation about how best to formulate the model to leverage both the data and quantum physics to best understand composition and temperature images of the sun.
presentation slides [.pdf]

Alex Blocker (HU)
Event Detection in Time Series Databases with Robust Wavelet Model
presentation slides [.pdf]

Paul Baines, Yaming Yu
17 Nov 2009
Markov Chain Monte Carlo Methods for Fitting Computer Models for Stellar Evolution (part two)
Paul Baines (Dept. of Statistics, Harvard U)
contd.

Yaming Yu (Dept. of Statistics, UCI)
Abstract: The importance of a good parameterization for efficient MCMC implementation has been repeatedly emphasized in the literature. For a broad class of multi-level models, there exist two well-known competing parameterizations: the centered parameterization and the non- centered parameterization. We describe a surprisingly general and powerful strategy for boosting MCMC efficiency by simply interweaving ---but not alternating---the two parameterizations. A Poisson time series model for detecting changes in source intensity of photon counts is used to illustrate the effectiveness of this strategy.

Victoria Liublinska (HU), Jin Xu (UCI), Jing Liu (UCI)
1 Dec 2009
Accounting for Missing Lines in Atomic Emissivity Databases Using DEM Analysis with High-resolution X-ray Spectra (VL)
Abstract: Access to substantial amount of data in the high-energy range gives us an opportunity to extend our knowledge of stellar coronal composition and temperature structure by analyzing the entire spectrum as a whole. Moreover, data from detectors with high spectral resolution will provide additional constraints on atomic data measurements being conducted in laboratories on the ground. In particular, the best atomic emissivity databases created by physicists still have missing, misplaced or poorly estimated lines and the goal of our analysis is to provide ways of identifying lines that were omitted and improve our estimates of stellar Differential Emission Measure and plasma abundance by incorporating the information about them.
Presentation [.pdf]

In addition Jin Xu will discuss Solar DEM reconstruction from photometric images and Jing Liu will present an update on X-ray Image Analysis of Quasar Jets.

Meng Xiao-li (Harvard)
Tulun Ergin (CfA)
26 Jan 2010
[XLM] A Statistician's View of Upcoming Grand Challenges in Astronomy
A re-imputation of an imputed talk given at the January Meeting of the American Astronomical Society in Washington DC
Abstract: There is a broad spectrum of astro-statistical challenges, in this age of huge, complex, and computer-intensive models, data, instruments, and questions. These challenges bridge astronomy at many wavelengths; basic physics; machine learning; -- and statistics. At one end of our spectrum, we think of 'compressing' the data with non-parametric methods. This raises the question of creating 'pseudo-replicas' of the data for uncertainty estimates. What would be involved in, e.g. boot-strap and related methods? Somewhere in the middle are these non-parametric methods for encapsulating the uncertainty information. At the far end, we find more model-based approaches, with the physics model embedded in the likelihood and analysis. The other distinctive problem is really the 'black-box' problem, where one has a complicated e.g. fundamental physics-based computer code, or 'black box', and one needs to know how changing the parameters at input -- due to uncertainties of any kind -- will map to changing the output. All of these connect to challenges in complexity of data and computation speed. Dr. Meng will highlight ways to 'cut corners' with advanced computational techniques, such as Parallel Tempering and Equal Energy methods. As well, there are cautionary tales of running automated analysis with real data -- where "30 sigma" outliers due to data artifacts can be more common than the astrophysical event of interest.
AAS Presentation [.ppt]

Extended Sources in TeV and GeV Energies
Presentation: [pdf]

Alex Blocker
David Stenning
Jin Xu
09 Feb 2010
[DS] -- sunspot classification
[pdf]

[JX] -- solar DEM
[pdf]

[AB] Doing Right By Massive Data: How To Bring Probability Modeling To The Analysis Of Huge Datasets Without Taking Over The Datacenter
Abstract: The analysis of extremely large-scale complex datasets is becoming an increasingly important task in the analysis of scientific data. This trend is especially prevalent in astronomy, as large-scale surveys such as SDSS, Pan-STARRS, and the LSST deliver (or promise to deliver) unprecedented amounts of data. While both the statistics and machine-learning communities have offered approaches to these problems, neither has produced a satisfactory approach. Statistical solutions are typically rigorous and well-motivated but do not scale well to massive datasets, whereas machine learning solutions typically lack statistical rigor and fail to account for the nuances of the scientific problem at hand. I will discuss an approach for combining much of the power of probability modeling with the scalability of more ad-hoc machine learning approaches in the context of an event detection problem for massive collections of time series. I will also provide comments on the assessment of uncertainty in this context and some general remarks on "using all of your tools, but in the right order," as a much pithier writer once said.
Presentation [.pdf]

Aneta Siemiginowska
23 Feb 2010
Abstract: Models of young radio sources predict that significant fraction of their energy should be radiated in X-rays and gamma-rays. Recent Chandra and Fermi/LAT observations can be used to constrain the theoretical models, determine energetics of young sources and their contribution to the background radiation. In my talk I review the current data and describe challenges in the statistical analysis of the Fermi data. The main goal in the future studies is to develop a full statistical model to evaluate the gamma-ray flux of young radio sources, verify theoretical models predicting high energy emission, and a distribution of the young radio source population in gamma-rays.
Presentation [.pdf]

[Jin Xu] Solar DEM
[.pdf]

Don Richard (Penn State)
23 Mar 2010
Maximum Likelihood Estimation and the Bayesian Information Criterion
Abstract: The talk will introduce the method of maximum likelihood in the context of problems in astrophysics and make several applications. We examine in detail the problem of fitting competing statistical models to the luminosity functions of globular clusters. We shall see that the Bayesian Information Criterion leads to a conclusion that the Gaussian model is to be preferred over the t-distribution model for GCLF in the Milky Way. Finally, we will discuss some open research problems and opportunities in the area.
[pdf]

Statistical Inference with Monotone Incomplete Multivariate Normal Data
Abstract: We consider problems in statistical inference with two-step, monotone incomplete data drawn from a multivariate normal population. We derive stochastic representations for the exact distributions of the maximum likelihood estimators of the population mean vector and covariance matrix and deduce a wide collection of results for inference on the mean vector, including: lower bounds on the level of confidence associated with ellipsoidal confidence regions for the mean, confidence regions for linear combinations of the components of the mean, and unbiasedness results for several testing problems on the mean vector and covariance matrix. With regard to problems of shrinkage estimation for the mean, we extend to the case of monotone incomplete samples a wide class of classical results on the reduced risk of estimators of James-Stein type. In testing for multivariate normality of monotone incomplete data, we construct Mardia-type statistics for testing kurtosis and skewness, and derive their asymptotic distributions. If time permits then we will provide an application to a well-known cholesterol data set featured in the Minitab Handbook.
[pdf]

Alex Blocker
06 Apr 2010
Doing Right By Massive Data: Using Probability Modeling To Advance The Analysis Of Huge Astronomical Datasets
Abstract: The analysis of extremely large, complex datasets is becoming an increasingly important task in the analysis of scientific data. This trend is especially prevalent in astronomy, as large scale surveys such as SDSS, EROS, Pan-STARRS, and the LSST deliver (or promise to deliver) terabytes of data per night. While both the statistics and machine-learning communities have offered approaches to these problems, neither has produced a completely satisfactory approach. Working in the context of event detection for the MACHO LMC data, I will present an approach that combines much of the power of Bayesian probability modeling with the the efficiency and scalability typically associated with more ad-hoc machine learning approaches. This provides both rigorous assessments of uncertainty and improved statistical efficiency on a dataset containing approximately 20 million sources and 40 million individual time series. I will also discuss how this framework could be extended to related problems.
rehearsal for NESS: [pdf]

Xie Xianchao
20 Apr 2010
Dust Temperature and Spectral Index Correlation?
Abstract: Recent advances in infrared and sub-millimeter technologies have allowed observations to be made on the dust emission in a variety of environments. Applying the spectral energy distribution fitting to the flux measurements, several independent research groups have concluded that there exists an inverse correlation between dust temperature and dust emissivity spectral index. However, it is also suspected that the empirical correlation might have been caused by the noise in the measurements, as illustrated in Shetty et al (2009). This talk discusses how Bayesian models can be possibly employed to address such an issue. MCMC methods, especially Gibbs samplers are used to conduct the posterior inference. Specific challenge in designing a fast convergence Gibbs chain is also noted.
[pdf]

Xu Jin
25 May 2010
Solar DEM reconstruction
[.pdf]

Victoria Liublinska
15 Jun 2010
Reconstructing stellar DEM and metallicity using high-resolution X-ray Spectra
[.pdf]

Jing
22 Jun 2010
Deconvolution of Quasar X-ray Images Using the EM Method

Fall/Winter 2004-2005
Siemiginowska, A. / Connors, A. / Kashyap, V. / Zezas, A. / Devor, J. / Drake, J. / Kolaczyk, E. / Izem, R. / Kang, H. / Yu, Y. / van Dyk, D.
Fall/Winter 2005-2006
van Dyk, D. / Ratner, M. / Jin, J. / Park, T. / CCW / Zezas, A. / Hong, J. / Siemiginowska, A. & Kashyap, V. / Meng, X.-L.
Fall/Winter 2006-2007
Lee, H. / Connors, A. / Protopapas, P. / McDowell, J., / Izem, R. / Blondin, S. / Lee, H. / Zezas, A., & Lee, H. / Liu, J.C. / van Dyk, D. / Rice, J.
Fall/Winter 2007-2008
Connors, A., & Protopapas, P. / Steiner, J. / Baines, P. / Zezas, A. / Aldcroft, T.
Fall/Winter 2008-2009
H. Lee / A. Connors, B. Kelly, & P. Protopapas / P. Baines / A. Blocker / J. Hong / H. Chernoff / Z. Li / L. Zhu (Feb) / A. Connors (Pt.1) / A. Connors (Pt.2) / L. Zhu (Mar) / E. Kolaczyk / V. Liublinska / N. Stein
Fall/Winter 2009-2010
A.Connors / B.Kelly / N.Stein, P.Baines / D.Stenning / J. Xu / A.Blocker / P.Baines, Y.Yu / V.Liublinska, J.Xu, J.Liu / X.L. Meng, et al. / A. Blocker, et al. / A. Siemiginowska / D. Richard / A. Blocker / X. Xie / X. Jin / V. Liublinska / L. Jing

CHASC