The AstroStat Slog » CHASC Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 [AAS-HEAD 2011] Time Series in High Energy Astrophysics Fri, 09 Sep 2011 17:05:33 +0000 vlk We organized a Special Session on Time Series in High Energy Astrophysics: Techniques Applicable to Multi-Dimensional Analysis on Sep 7, 2011, at the AAS-HEAD conference at Newport, RI. The talks presented at the session are archived at

A tremendous amount of information is contained within the temporal variations of various measurable quantities, such as the energy distributions of the incident photons, the overall intensity of the source, and the spatial coherence of the variations. While the detection and interpretation of periodic variations is well studied, the same cannot be said for non-periodic behavior in a multi-dimensional domain. Methods to deal with such problems are still primitive, and any attempts at sophisticated analyses are carried out on a case-by-case basis. Some of the issues we seek to focus on are methods to deal with are:
* Stochastic variability
* Chaotic Quasi-periodic variability
* Irregular data gaps/unevenly sampled data
* Multi-dimensional analysis
* Transient classification

Our goal is to present some basic questions that require sophisticated temporal analysis in order for progress to be made. We plan to bring together astronomers and statisticians who are working in many different subfields so that an exchange of ideas can occur to motivate the development of sophisticated and generally applicable algorithms to astronomical time series data. We will review the problems and issues with current methodology from an algorithmic and statistical perspective and then look for improvements or for new methods and techniques.

]]> 0
mini-Workshop on Computational AstroStatistics [announcement] Mon, 21 Jun 2010 16:25:31 +0000 chasc mini-Workshop on Computational Astro-statistics: Challenges and Methods for Massive Astronomical Data
Aug 24-25, 2010
Phillips Auditorium, CfA,
60 Garden St., Cambridge, MA 02138


The California-Boston-Smithsonian Astrostatistics Collaboration plans to host a mini-workshop on Computational Astro-statistics. With the advent of new missions like the Solar Dynamic Observatory (SDO), Panoramic Survey and Rapid Response (Pan-STARRS) and Large Synoptic Survey (LSST), astronomical data collection is fast outpacing our capacity to analyze them. Astrostatistical effort has generally focused on principled analysis of individual observations, on one or a few sources at a time. But the new era of data intensive observational astronomy forces us to consider combining multiple datasets and infer parameters that are common to entire populations. Many astronomers really want to use every data point and even non-detections, but this becomes problematic for many statistical techniques.

The goal of the Workshop is to explore new problems in Astronomical data analysis that arise from data complexity. Our focus is on problems that have generally been considered intractable due to insufficient computational power or inefficient algorithms, but are now becoming tractable. Examples of such problems include: accounting for uncertainties in instrument calibration; classification, regression, and density estimations of massive data sets that may be truncated and contaminated with measurement errors and outliers; and designing statistical emulators to efficiently approximate the output from complex astrophysical computer models and simulations, thus making statistical inference on them tractable. We aim to present some issues to the statisticians and clarify difficulties with the currently used methodologies, e.g. MCMC methods. The Workshop will consist of review talks on current Statistical methods by Statisticians, descriptions of data analysis issues by astronomers, and open discussions between Astronomers and Statisticians. We hope to define a path for development of new algorithms that target specific issues, designed to help with applications to SDO, Pan-STARRS, LSST, and other survey data.

We hope you will be able to attend the workshop and present a brief talk on the scope of the data analysis problem that you confront in your project. The workshop will have presentations in the morning sessions, followed by a discussion session in the afternoons of both days.

]]> 0
[MADS] multiscale modeling Thu, 11 Dec 2008 19:46:05 +0000 hlee A few scientists in our group work on estimating the intensities of gamma ray observations from sky surveys. This work distinguishes from typical image processing which mostly concerns the point estimation of intensity at each pixel location and the size of overall white noise type error. Often times you will notice from image processing that the orthogonality between errors and sources, and the white noise assumptions. These assumptions are typical features in image processing utilities and modules. On the other hand, CHASC scientists relate more general and broad statistical inference problems in estimating the intensity map, like intensity uncertainties at each point and the scientifically informative display of the intensity map with uncertainty according to the Poisson count model and constraints from physics and the instrument, where the field, multiscale modeling is associated.

As the post title [MADS] indicates, no abstract has keywords multiscale modeling. It seems like that just the jargon is not listed in ADS since “multiscale modeling” is practiced in astronomy. One of examples is our group’s work. Those CHASC scientists take Bayesian modeling approaches, which makes them unique to my knowledge in the astronomical society. However, I expected constructing an intensity map through statistical inference (estimation) or “multiscale modeling” to be popular among astronomers in recent years. Well, none came along from my abstract keyword search.

Wikipedia also shows a very brief description of multiscale modeling and emphasized that it is a fairly new interdisciplinary topic. wiki:multiscale_modeling. TomLoredo kindly informed me some relevant references from ADS after my post [MADS] HMM. He mentioned his search words were Markov Random Fields which can be found from stochastic geometry and spatial statistics in addition to many applications in computer science. Not only these publications but he gave me a nice comment on analyzing astronomical data, which I’d rather postpone for another discussion.

The reason I was not able to find these papers was that they are not published in the 4 major astronomical publications + Solar Physics. The reason for this limited search is that I was overwhelmed by the amount of unlimited search results including arxiv. (I wonder if there is a way to do exclusive searches in ADS by excluding arxiv:comp, arxiv:phys, arxiv:math, etc). Thank you, Tom, for providing me these references.

Please, check out CHASC website for more study results related to “multiscale modeling” from our group.

[Added] Nice tutorials related to Markov Random Fields (MRF) recommended by an expert in the field and a friend (all are pdfs).

  1. Markov Random Fields and Stochastic Image Models (ICIP 1995 Invited Tutorial)
  2. Digital Image Processing II Reading List
  3. MAP, EM, MRFs, and All That: A User’s Guide to the Tools of Model Based Image Processing (incomplete)
]]> 1
survey and design of experiments Wed, 01 Oct 2008 20:16:24 +0000 hlee People of experience would say very differently and wisely against what I’m going to discuss now. This post only combines two small cross sections of each branch of two trees, astronomy and statistics.

When it comes to survey, the first thing comes in my mind is the census packet although I only saw it once (an easy way to disguise my age but this is true) but the questionaire layouts are so carefully and extensively done so as to give me a strong impression. Such survey is designed prior to collecting data so that after collection, data can be analyzed according to statistical methodology suitable to the design of the survey. Strategies for response quantification is also included (yes/no for 0/1, responses in 0 to 10 scale, bracketing salaries, age groups, and such, handling missing data) for elaborated statistical analysis to avoid subjective data transformation and arbitrary outlier eliminations.

In contrast, survey in astronomy means designing a mesh, not questionaires, unable to be transcribed into statistical models. This mesh has multiple layers like telescope, detector, and source detection algorithm, and eventually produces a catalog. Designing statistical methodology is not a part of it that draws interpretable conclusion. Collecting what goes through that mesh is astronomical survey. Analyzing the catalog does not necessarily involve sophisticated statistics but often times adopts chi-sq fittings and cast aways of unpleasant/uninteresting data points.

As other conflicts in jargon, –a simplest example is Ho: I used to know it as Hubble constant but now, it is recognized first as a notation for a null hypothesissurvey has been one of them and like the measurement error, some clarification about the term, survey is expected to be given by knowledgeable astrostatisticians to draw more statisticians involvement in grand survey projects soon to come. Luckily, the first opportunity will be given soon at the Special Session: Meaning from Surveys and Population Studies: BYOQ during the 213 AAS meeting, at Long Beach, California on Jan. 5th, 2009.

]]> 3
The LRT is worthless for … Fri, 25 Apr 2008 05:48:06 +0000 hlee One of the speakers from the google talk series exemplified model based clustering and mentioned the likelihood ratio test (LRT) for defining the number of clusters. Since I’ve seen the examples of ill-mannerly practiced LRTs from astronomical journals, like testing two clusters vs three, or a higher number of components, I could not resist indicating that the LRT is improperly used from his illustration. As a reply, the citation regarding the LRT was different from his plot and the test was carried out to test one component vs. two, which closely observes the regularity conditions. I was relieved not to find another example of the ill-used LRT.

There are various tests applicable according to needs and conditions from data and source models but it seems no popular astronomical lexicons have these on demand tests except the LRT (Once I saw the score test since posting [ArXiv]s in the slog and a few non-parametric rank based tests over the years). I’m sure well knowledgeable astronomers soon point out that I jumped into a conclusion too quickly and bring up counter-examples. Until then, be advised that your LRTs, χ^2 tests, and F-tests ask for your statistical attention prior to their applications for any statistical inferences. These tests are not magic crystals, producing answers you are looking for. To bring such care and attentions, here’s a thought provoking titled paper that I found some years ago.

The LRT is worthless for testing a mixture when the set of parameters is large
JM Aza─▒s, E Gassiat, C Mercadier (click here :I found it from internet but it seems the link was on and off and sometimes was not available.)

Here, quotes replace theorems and their proofs[1] :

  • We prove in this paper that the LRT is worthless from testing a distribution against a two components mixture when the set of parameters is large.
  • One knows that the traditional Chi-square theory of Wilks[16[2]] does not apply to derive the asymptotics of the LRT due to a lack of identifiability of the alternative under the null hypothesis.
  • …for unbounded sets of parameters, the LRT statistic tends to infinity in probability, as Hartigan[7[3]] first noted for normal mixtures.
  • …the LRT cannot distinguish the null hypothesis (single gaussian) from any contiguous alternative (gaussian mixtures). In other words, the LRT is worthless[4].

For astronomers, the large set of parameters are of no concern due to theoretic constraints from physics. Experiences and theories bound the set of parameters small. Sometimes, however, the distinction between small and large sets can be vague.

The characteristics of the LRT is well established under the compactness set assumption (either compact or bounded) but troubles happen when the limit goes to the boundary. As cited before in the slog a few times, readers are recommend to read for more rigorously manifested ideas from astronomy about the LRT, Protassov, (2002) Statistics, Handle with Care: Detecting Multiple Model Components with the Likelihood Ratio Test, ApJ, 571, p. 545

  1. Readers might want to look for mathematical statements and proofs from the paper
  2. S.S.Wilks. The large sample distribution of the likelihood ratio for testing composite hypothesis, Ann. Math. Stat., 9:60-62, 1938
  3. J.A.Hartigan, A failure of likelihood asymptotics for normal mixtures. in Proc. Berkeley conf. Vol. II, pp.807-810
  4. Comment to Theorem 1. They proved the lack of worth in the LRT under more general settings, see Theoremm 2
]]> 0
[ArXiv] Astronomy Job Market in US Fri, 21 Dec 2007 17:47:59 +0000 hlee It’s a report about the job market in US.

[astro-ph:0712.2820] The Production Rate and Employment of Ph.D. Astronomers T.S. Metcalfe

Related Comments:

  1. Much more jobs than I expected. However, it cannot compete with jobs in Statistics.
  2. Three jobs before having a stable one in astronomy. Do not know in statistics.
  3. Astronomy Ph.D. students receive more cares, in a sense that the job market is controlled to guarantee a position for every student. In statistics, without care you can find something (not necessary a research position).

Unrelated Comment on Correlation:
It’s a cultural difference. Maybe not. When I learned correlation years ago from a textbook, the procedure is, 1. compute the correlation and 2. do a t-test. In astronomical papers, 1. do regression and 2. plot the simple linear regression line with error bands and data points. The computing procedure is same but the way illustrates the results seems different.

I wonder what would it be like when we narrow the job market for astrostatisticians.

]]> 0
Arrogant? Fri, 07 Sep 2007 00:47:14 +0000 hlee I once talked about the relationship between astronomers and statisticians in the slog posting Data Doctors. To astronomers, statisticians are assistants. Statisticians are just helping astronomical data analysis with statistically limited eyes. Less frequently statistical improvements and modification occurred in the astronomical society through collaborations with statisticians compared to other fields.

Not all agree with it. I myself changed this biased opinion to the lesser degree. However, I still think astronomers appear to be full of pride from the eyes of non astronomers.

Having educations from both fields, while learning statistics and being away from astronomy for some years, I’ve tried to talk astronomers if I have some chance. For years I couldn’t get rid of the impression that astronomers are arrogant in a sense that they believe they can do statistics without the help of statisticians. When I was looking for a job as statistician in astronomy, most of astronomers said they only need technicians who are good at coding algorithms including statistical ones. Their behind implication looked as only astronomers do statistics and statistical data analysis with astronomical data.

However, as mentioned above, my opinion has been changed by some degree after coming to CfA and working with CHASC. The reasons for this change are:

  • Assistants can serve arrogant chiefs (this is for a laugh).
  • When you work with astronomers, you don’t feel they are arrogant at all. Astronomers are very passionate about what they are doing and have very pure minds. The passion and objectivity of astronomers is different from that of statisticians, which gave a little barrier me talking to astronomers, particularly when I was looking for a job.
  • Most importantly, astronomers know better in their data, in general very expensive, and the data peculiarity, partially discussed in Quote of the week, Aug. 31, 2007 hinders quick collaborations between two communities (Instead of explaining instruments and physics to statisticians years and years, it could be quicker that astronomers do statistics).

As we separate probability, theoretical statistics, biostatistics, spatial statistics, bioinformatics, applied statistics, data mining, machine learning, etc, I hope astrostatistics could share its position equivalently with other subfields in the statistics community. I hope more astronomers explain astronomy to statisticians with patience. Eventually, the impression of proud astronomers will die and many statistics for improved estimation and inference will born.

[Addendum] A few senior statisticians, in a casual fashion, expressed that my interests in astrostatistics is reckless.

]]> 1
[ArXiv] NGC 6397 Deep ACS Imaging, Aug. 29, 2007 Wed, 05 Sep 2007 06:26:20 +0000 hlee From arxiv/astro-ph:0708.4030v1
Deep ACS Imaging in the Globular Cluster NGC 6397: The Cluster Color Magnitude Diagram and Luminosity Function by H.B. Richer

This paper presented an observational study of a globular cluster, named NGC 6397, enhanced and more informative compared to previous observations in a sense that 1) a truncation in the white dwarf cooling sequence occurs at 28 magnitude, 2) the cluster main sequence seems to terminate approximately at the hydrogen-burning limit predicted by two independent stellar evolution models, and 3) luminosity functions (LFs) or mass functions (MFs) are well defined. Nothing statistical, but the idea of defining color magnitude diagrams (CMDs) and LFs described in the paper, will assist developing suitable statistics on CMD and LF fitting problems in addition to the improved measurements (ACS imaging) of stars in NGC 6397.

Instead of adding details of data properties and calibration process including the instrument characteristics, I like to add a few things for statisticians: First, ACS stands for Advance Camera of Surveys and its information can be found at this link. Second, NGC is an abbreviation of New General Catalogue, one of astronomers’ cataloging systems (click for its wiki). Third, CMDs and LFs are results of data processing, described in the paper, but can be considered as scatter plots and kernel density plots (histograms) to be analyzed for inferencing physical parameters. This data processing, or calibration requires multi-level transformations, which cause error propagation. Finally, the chi-square method is incorporated to fit LFs and MFs. Among numerous fitting methods, in astronomy, only the chi-square is ubiquitously used (link to a discussion on the chi-square). Could we develop more robust statistics for fitting astronomical (empirical) functions?

]]> 0
Quote of the Week, Aug 31, 2007 Sat, 01 Sep 2007 03:47:06 +0000 aconnors David van Dyk (representing statistics culture):
My assertion is that I find replicated results more convincing than extreme p-values. And the controversial part: Astronomers should aim for replication rather than worry about 5-sigma.
Once again, the middle of a recent (Aug 30-31, 2007) argument within CHASC, on why physicists and astronomers view “3 sigma” results with suspicion and expect (roughly) > 5 sigma; while statisticians and biologists typically assume 95% is OK:

David van Dyk (representing statistics culture):

Can’t you look at it again? Collect more data?

Vinay Kashyap (representing astronomy and physics culture):

…I can confidently answer this question: no, alas, we usually cannot look at it again!!

Ah. Hmm. To rephrase [the question]: if you have a “7.5 sigma” feature, with a day-long [imaging Markov Chain Monte Carlo] run you can only show that it is “>3sigma”, but is it possible, even with that day-long run, to tell that the feature is really at 7.5sigma — is that the question? Well that would be nice, but I don’t understand how observing again will help?

David van Dyk :

No one believes any realistic test is properly calibrated that far into the tail. Using 5-sigma is really just a high bar, but the precise calibration will never be done. (This is a reason not to sweet the computation TOO much.)

Most other scientific areas set the bar lower (2 or 3 sigma) BUT don’t really believe the results unless they are replicated.

My assertion is that I find replicated results more convincing than extreme p-values. And the controversial part: Astronomers should aim for replication rather than worry about 5-sigma.

]]> 3
[ArXiv] Numerical CMD analysis, Aug. 28th, 2007 Fri, 31 Aug 2007 01:36:38 +0000 hlee From arxiv/astro-ph:0708.3758v1
Numerical Color-Magnitude Diagram Analysis of SDSS Data and Application to the New Milky Way Satellites by J. T. A. de Jong et. al.

The authors applied MATCH (Dolphin, 2002[1] -note that the year is corrected) to M13, M15, M92, NGC2419, NGC6229, and Pal14 (well known globular clusters), and BooI, BooII, CvnI, CVnII, Com, Her, LeoIV, LeoT, Segu1, UMaI, UMaII and Wil1 (newly discovered Milky Way satellites) from Sloan Digital Sky Survey (SDSS) to fit Color Magnitude diagrams (CMDs) of these stellar clusters and find the properties of these satellites.

A traditional CMD fitting begins with building synthetic CMDs: Completeness of SDSS Data Release 5, Hess diagram (a bivariate histogram from a CMD), and features in MATCH for CMD synthesis were taken into account. The synthetic CMDs of these well known globular clusters were utilized with the observations from SDSS and compared to previous discoveries to validate their modified MATCH for the SDSS data sets. Afterwards, their method was applied to the newly discovered Milky Way satellites and discussion on their findings of these satellites was presented.

The paper provides plots that enhance the understanding of age, metalicity, and other physical parameter distributions of stellar clusters after they were fit with synthetic CMDs. The paper also describes steps and tricks (to a statistician, the process of simulating stars looks very technical without a mathematical/probabilistic justification) to acquire proper synthetic CMDs that match observations. The paper adopted Padova database of stellar evolutionary tracks and isochrones (there are other databases beyond Padova).

At last, I’d like to add a sentence from their paper, which supports my idea that a priori knowledge in choosing a proper isochrone database is necessary.

In the case of M15, this is due to the blue horizontal branch (BHB) stars that are not properly reproduced by the theoretical isochrones, causing the code to fit them as a younger turn-off.

  1. Numerical methods of star formation history measurement and applications to seven dwarf spheroidals,Dolphin (2002), MNRAS, 332, p. 91
]]> 0