Last Updated: 2010aug31
mini-Workshop on Computational AstroStatistics
Challenges and Methods for Massive Astronomical Data
Aug 24-25, 2010
The California-Boston-Smithsonian AstroStatistics Collaboration
hosted a mini-workshop on Computational Astro-statistics
at the CfA.
With the advent of new missions like the Solar Dynamic Observatory
(SDO), Panoramic Survey and Rapid Response (Pan-STARRS) and Large
Synoptic Survey (LSST), astronomical data collection is fast outpacing
our capacity to analyze them.  Astrostatistical effort has generally
focused on principled analysis of individual observations, on one
or a few sources at a time.  But the new era of data intensive
observational astronomy forces us to consider combining multiple
datasets and infer parameters that are common to entire populations.
Many astronomers really want to use every data point and even
non-detections, but this becomes problematic for many statistical
techniques.
The goal of the Workshop was to explore new problems in Astronomical
data analysis that arise from data complexity.  Our focus is on
problems that have generally been considered intractable due to
insufficient computational power or inefficient algorithms, but are
now becoming tractable.  Examples of such problems include: accounting
for uncertainties in instrument calibration; classification,
regression, and density estimations of massive data sets that may
be truncated and contaminated with measurement errors and outliers;
and designing statistical emulators to efficiently approximate the
output from complex astrophysical computer models and simulations,
thus making statistical inference on them tractable.  We aimed to
present some issues based on existing X-ray data from observatories
such as Chandra and XMM-Newton to the statisticians and clarify
difficulties with the currently used methodologies, e.g. MCMC
methods.  The Workshop consisted of review talks on current
Statistical methods by Statisticians, descriptions of data analysis
issues by astronomers, and open discussions between Astronomers and
Statisticians.  We hope to define a path for development of new
algorithms that target specific issues, designed to help with
applications to SDO, Pan-STARRS, LSST, and other survey data.
The schedule was structured to encourage questions and discussion, both
during the talks themselves as well as during the loosely structured
Discussion sessions at the end of the day.
Alanna's notes: [.txt] (internal only)
9:30am - Noon : Session 1A : video [.rm 277MB]
 Moderator: Andreas Zezas (Crete)
-  Aneta Siemiginowska (SAO) : Welcome and Introduction
	
- [.pdf]
-  Kirk Borne (George Mason) : LSST: Informatics and Statistics Research Challenges
	
- Abstract
 The proposed Large Synoptic Survey Telescope (LSST) would
	generate the equivalent of one entire Sloan Digital Sky Survey's 
	data output each night for 10 years.  The scientific discovery 
	potential from these data is enormous, as are the research challenges
	that they impose.  I will review briefly the plans for LSST
	and for the new LSST Informatics and Statistical Sciences
	research collaboration team.  The primary emphasis will be
	on the research questions that are related to the large, complex 
	data collection to be produced by the survey.  These research 
	questions will be framed within the context of a new emerging
	astronomy subdiscipline, Astroinformatics.
 [.ppt]
-  Keith Arnaud (GSFC) : LISA: A Big Problem on a Small Data Set
	
- Abstract
 The Laser Interferometer in Space Antenna is a planned NASA/ESA mission 
	to measure gravitational waves. Although the basic LISA data set 
	comprises only three time series their analysis is a significant 
	problem in computational astrostatistics because the signals from tens 
	of thousands of sources are superimposed. I will describe the problem 
	and show some of the approaches adopted.
 [.ppt]
-  Brandon Kelly (SAO) : Constraining astronomical populations with truncated data sets
	
- Abstract
 Understanding astronomical populations and their evolution
	is often one of the primary goals of large surveys. However,
	this is not always a straightforward task in that the
	quantities of interest, such as mass, are not measurable,
	but rather they are derived from measurable quantities such
	as luminosity with uncertainty. Moreover, the situation is
	complicated by data truncation caused by brightness limits
	of telescopes. This makes it difficult to perform statistical
	inference on the astronomical populations, especially if
	one wants to accurately account for the uncertainty in the
	derived parameters. In this talk I will discuss a Bayesian
	approach to this problem, based on hierarchical modeling,
	as well as recent applications of this approach to astronomical
	surveys. I will conclude by discussing some of the computational
	problems facing this approach, outlining areas where further
	work is needed.
 [.ppt]
Noon - 1:30pm : Lunch break
1:30pm - 4pm : Session 1B : video [.rm 244MB]
 Moderator: Paul Baines (UC Davis)
-  Peter Freeman (CMU) : Nonlinear Data Reparametrization with Diffusion Map
	
- Abstract
 Data that inhabit complex structures in high-dimensional
	spaces, such as galaxy spectra, often possess a simpler
	underlying geometry.  Diffusion map is a nonlinear
	eigen-technique that captures that geometry by propagating
	local neighborhood information through a Markov process.
	It thus allows one to find a natural coordinate system for
	data whose original parametrization is not amenable to
	available statistical techniques.  In this talk I will
	review the basics of diffusion map, show how it has been
	applied to various datasets by members of our group, and
	outline the challenges we face in scaling up diffusion map-
	based algorithms in the era of LSST.
 [.pdf]
-  Joey Richards (UC Berkeley) : Real-time Classification for The Palomar Transient Factory
	
- Abstract
 I will be talking about the challenges of classifying astronomical
	time-series data, such as the photometric light curves collected by
	PTF.  Recently, we created a method for supernova light curve
	classification, and I will show results using data from the DES
	Supernova Photometric Classification Challenge.  The next challenge
	is to extend these methods to deal with highly multi-class problems
	and to scale them up for real-time classification in preparation for
	the LSST.
 [.pdf]
-  Daryl Geller (Stony Brook) : Spherical wavelets for CMB temperature and polarization data analysis
	
- Abstract
 Spherical wavelets are a tool for spherical data analysis.
	Their main advantage over spherical harmonics is their
	localization, both in space and frequency.   (Of course,
	spherical harmonics are completely localized in frequency,
	but they are spread all over the sphere.)    This property
	of spherical wavelets has been exploited in CMB analysis,
	in particular in avoiding foregrounds/masked regions, and
	also in searching for features/asymmetries, specifically
	the "cold spot".
 We discuss four different kinds of spherical wavelets, all
	of "needlet" type; they all possess the crucial properties
	of localization.  In addition, under mild conditions, the
	needlet coefficients of a random field  (defined by taking
	inner products of the random field with the needlets) turn
	out to be asymptotically uncorrelated, making it possible
	to exploit the law of large numbers for power spectrum-type
	estimations.
 For the study of CMB temperature, we discuss standard
	needlets and introduce Mexican needlets; for the study of
	CMB polarization, we introduce spin needlets; and for the
	study of cross-spectra between the temperature and polarization
	fields (one of the main objectives of the Planck mission),
	we introduce mixed needlets.
 My contributions have been in collaboration with Domenico
	Marinucci, Frode Hansen and Azita Mayeli.
 [.pdf]
4pm - 4:20pm : Coffee break
4:20pm - 5:30pm : Open Discussion : video [.rm 114MB]
Moderator: Vinay Kashyap (SAO)
	
[.pdf]
-  Nick Wright (SAO) : Statistical Challenges in the Chandra Cygnus OB2 Survey
	
-  [.pdf]
-  Raffaele D'Abrusco (SAO) : IVOA IG-KDD
	
-  [.pdf]
-  Shantanu Desai (Illinois) : Dark Energy Data Management System : Overview and Challenges
	
-  [.pdf]
9:30am - Noon : Session 2A : video [.rm 288MB]
 Moderator: Jeremy Drake (SAO)
-  Alisdair Davey & Paola Testa (SAO) :
Challenges in Data Distribution and Analysis with the Solar Dynamics
Observatory
	
	
- Abstract
 We discuss the challenges of storing, accessing, and analyzing
	the large volume of data (~2TB/day) from the Solar Dynamics
	Observatory (SDO).  New tools are required to use SDO data
	effectively, and in particular meta-data are created to
	allow scientists to identify and retrieve data sets that
	address their particular science questions. We present the
	results of the efforts in this regard by the SDO Feature
	Finding team, to build a comprehensive computer vision
	pipeline for SDO. This pipeline will provide complete
	metadata on many of the features and events detectable on
	the Sun without human intervention and making them available
	to the entire solar community. We also talk about the
	challenges of providing access to the data to solar scientists
	from round the planet.
 [.pptx]
 [
	movie1,
	movie2,
	movie3,
	movie4
	]
-  Ashish Mahabal (CalTech) : Where statistical methods can help with Transients classification from surveys
	
- Abstract
 Recent advances in observing and computing  technology have
	led to a large explosion in astronomical data in terms of
	sheer volume, so much so that there is no way humans can
	look at all the data. As a result of synoptic sky surveys
	like Palomar-Quest and Catalina Realtime Transient Survey
	digital movies instead of individual snapshots are available
	(although with often large gaps between successive frames).
	Detecting transients in these streams is the starting point
	of many interesting projects (new classes, sub-populations
	of objects, and in general better understanding of the
	nature of different types of astronomical objects).
	Characterizing and classifying the transients is not easy
	though, partly owing to the sparsity of the data as well
	as the presence of upper limits (varying error-bars, missing
	data, censored data etc.), and mainly because it has to be
	done based on just a small number of initial observations.
	Even new developments may be required to make substantial
	progress. Sometimes some context information helps. This
	is often from other wavelengths and with rather different
	characteristics. Combining the heterogenous data forms
	another challenge. I will present details on these and a
	few other issues as well as the current status. Other
	existing and forthcoming surveys (e.g. LSST, ASKAP-VAST,
	Gaia) will automatically benefit from advances in this area.
 [.pdf]
-  Pavlos Protopapas (SAO) : Discovery of celestial objects using machine learning techniques
	
- Abstract
 In the modern era of astronomy data are expanding in
	exponential rate.  Our current traditional methods do not
	work for these massive data rates and  machine learning has
	been called to the rescue. In this talk I will present with
	few examples of machine learning that have used in the Time
	Series Center in order to discover new celestial objects.
	We are discovering new variable stars, new Quasars and new
	objects at the very edge of the solar system. These discoveries
	are helping us shape our understanding of the universe we
	live in and could only be possible with advanced machine
	learning methods.
 [.pdf]
Noon - 1:30pm : Lunch break
1:30pm - 4pm : Session 2B : video [.rm 224MB]
 Moderator: Brandon Kelly (CfA)
-  Alexander Gray (Georgia Tech) : Beyond RAM: Fast Statistical Analysis in Databases
	
- Abstract
 In recent years we have developed the fastest current
	algorithms for various critical computations in astrostatistics,
	including n-point correlation functions, all-nearest-neighbors,
	kernel density estimation, and nonparametric Bayes
	classification.  The codes, however, were developed assuming
	the data can fit in memory.  I'll discuss how we have begun
	to enable such fast algorithms in the setting where the
	data fit on disk, but not necessarily RAM, by employing a
	novel disk-based tree structure and algorithmic approach.
	I will show experimental runtimes for an implementation
	within Microsoft SQL Server.
-  Alex Blocker (Harvard) : Semi-parametric Robust Event Detection for Massive Time-Series Datasets
	
- Abstract
 The detection and analysis of events within massive collections of
	time-series has become an extremely important task for time-domain
	astronomy. In particular, many scientific investigations (e.g. the
	analysis of microlensing and other transients) begin with the
	detection of isolated events in irregularly-sampled series with both
	non-linear trends and non-Gaussian noise. I will discuss a
	semi-parametric, robust, parallel method for identifying variability
	and isolated events at multiple scales in the presence of the above
	complications. This approach harnesses the power of Bayesian modeling
	while maintaining much of the speed and scalability of more ad-hoc
	machine learning approaches. I will also contrast this work with event
	detection methods from other fields, highlighting the unique
	challenges posed by astronomical surveys. Finally, I will present
	initial results from the application of this method to 87.2 million
	EROS sources, where we have obtained a greater than 100-fold reduction
	in candidates for certain types of phenomena.
 [.pdf]
-  Lukasz Wyrzykowski (Cambridge) : Transient classification with Gaia
	
- Abstract
 Gaia is a successor of Hipparcos mission and its main goal is to
	derive positions, distances and motion information about a billion
	stars and create a 6D image of the Galaxy. In its 5 years life-time
	(from 2012) it will repeatedly scan the entire sky allowing also for
	the almost-real-time detections of new objects or anomalous behaviour
	of stars.
 In my talk I will present the preparations undertaken for the
	detection and classification of transient events in Gaia. I will
	describe proposed detection methods and show first results of the
	classification of simulated data using SOMs, ANNs and Bayesian
	classifiers.
 [.pdf]
4pm - 4:20pm : Coffee break
4:20pm - 5:15 : Open Discussion : video [.rm 119MB]
Moderator: David van Dyk (UC Irvine)
-  Alanna Connors (Eureka Sci) : Workshop Wrap-up
	
- [.rtf]
-  Alanna Connors (Eureka Sci) 
-  Alberto Conti (STScI) 
-  Alex Blocker (Harvard) 
-  Alexander Gray (Georgia Tech) 
-  Alisdair Davey (SAO) 
-  Allison Strom (SAO) 
-  Andreas Zezas (Crete) 
-  Aneta Siemiginowska (SAO) 
-  Angelica de Oliveira Costa (CfA) 
-  Ashish Mahabal (CalTech) 
-  Brandon Kelly (CfA) 
-  Brendan Allen (CfA) 
-  Cecilia Garraffo (CfA) 
-  Chris Stubbs (CfA) 
-  Daryl Geller (Stony Brook) 
-  David Kipping (CfA) 
-  David Stenning (UC Irvine) 
-  David van Dyk (UC Irvine) 
-  Dharam Vir Lal (CfA) 
-  Eric Kolaczyk (BU) 
-  Francesco Massaro (SAO) 
-  Gautham Narayan (CfA) 
-  Ignazio Pillitteri (SAO) 
-  Irwin Shapiro (CfA) 
-  Jan Forbrich (CfA) 
-  Jennifer Posson-Brown (SAO) 
-  Jeremy Drake (SAO) 
-  Jin Xu (UC Irvine) 
-  Joey Richards (UC Berkeley) 
-  Kaisey Mandel (CfA) 
-  Karim Pichara (PUC de Chile) 
-  Keith Arnaud (GSFC) 
-  Kirk Borne (George Mason) 
-  Li Ji (CfA) 
-  Lukasz Wyrzykowski (Cambridge) 
-  Margarita Karovska (SAO) 
-  Nathan Stein (Harvard) 
-  Nick Wright (SAO) 
-  Paola Testa (SAO) 
-  Paul Baines (UC Davis) 
-  Paul Green (SAO) 
-  Pavlos Protopapas (CfA) 
-  Pete Ratzlaff (SAO) 
-  Peter Freeman (Carnegie Mellon) 
-  Raffaele D'Abrusco (CfA) 
-  Saku Vrtilek (SAO) 
-  Shandong Min (UC Irvine) 
-  Shantanu Desai (Illinois) 
-  Susana Eyheramendy (PUC de Chile) 
-  Terry Gaetz (SAO) 
-  Thomas Granger (CfA) 
-  Tom Aldcroft (SAO) 
-  Trae Winter (SAO) 
-  Vinay Kashyap (SAO) 
-  Xiao-Li Meng (Harvard) 
-  Yaming Yu (UC Irvine) 
  
Vinay Kashyap (vkashyap @ cfa . harvard . edu)
Aneta Siemiginowska (asiemiginowska @ cfa . harvard . edu)
David van Dyk (dvd @ ics . uci . edu)
This workshop was supported by 
CHASC/C-BAS,
NSF grants DMS 09-07185 (HU) and DMS 09-07522 (UCI), and the
Chandra X-Ray Center