The AstroStat Slog » Python

Everybody needs crampons

vlk — Fri, 30 Apr 2010 16:12:36 +0000

Sherpa is a fitting environment in which Chandra data (and really, X-ray data from any observatory) can be analyzed. It has just undergone a major update and now runs on python. Or allows python to run. Something like that. It is a very powerful tool, but I can never remember how to use it, and I have an amazing knack for not finding what I need in the documentation. So here is a little cheat sheet (which I will keep updating ~~as and when~~ if I learn more):

2010-apr-30: Aneta has setup a blogspot site to deal with simple Sherpa techniques and tactics: http://pysherpa.blogspot.com/

On Help:

In general, to get help, use: ahelp "something" (note the quotes)
Even more useful, type: ? wildcard to get a list of all commands that include the wildcard
You can also do a form of autocomplete: type TAB after writing half a command to get a list of all possible completions.

Data I/O:

To read in your PHA file, use: load_pha()
Often for Chandra spectra, the background is included in that same file. In any case, to read it in separately, use: load_bkg()
- Q: should it be loaded in to the same dataID as the source?
- A: Yes.
- A: When the background counts are present in the same file, they can be read in separately and assigned to the background via set_bkg('src',get_data('bkg')), so counts from a different file can be assigned as background to the current spectrum.
To read in the corresponding ARF, use: load_arf()
- Q: load_bkg_arf() for the background — should it be done before or after load_bkg(), or does it matter?
- A: does not matter
To read in the corresponding RMF, use: load_rmf()
- Q: load_bkg_rmf() for the background, and same question as above
- A: same answer as above; does not matter.
To see the structure of the data, type: print(get_data()) and print(get_bkg())
To select a subset of channels to analyze, use: notice_id()
To subtract background from source data, use: subtract()
To not subtract, to undo the subtraction, etc., use: unsubtract()
To plot useful stuff, use: plot_data(), plot_bkg(), plot_arf(), plot_model(), plot_fit(), etc.
(Q: how in god’s name does one avoid plotting those damned error bars? I know error bars are necessary, but when I have a spectrum with 8192 bins, I don’t want it washed out with silly 1-sigma Poisson bars. And while we are asking questions, how do I change the units on the y-axis to counts/bin? A: rumors say that plot_data(1,yerr=0) should do the trick, but it appears to be still in the development version.)

Fitting:

To fit model to the data, command it to: fit()
To get error bars on the fit parameters, use: projection() (or covar(), but why deliberately use a function that is guaranteed to underestimate your error bars?)
Defining models appears to be much easier now. You can use syntax like: set_source(ModelName.ModelID+AnotherModel.ModelID2) (where you can distinguish between different instances of the same type of model using the ModelID — e.g., set_source(xsphabs.abs1*powlaw1d.psrc+powlaw1d.pbkg))
To see what the model parameter values are, type: print(get_model())
To change statistic, use: set_stat() (options are various chisq types, cstat, and cash)
To change the optimization method, use: set_method() (options are levmar, moncar, neldermead, simann, simplex)

Timestamps:
v1:2007-dec-18
v2:2008-feb-20
v3:2010-apr-30

some python modules

hlee — Fri, 13 Nov 2009 21:46:54 +0000

I was told to stay away from python and I’ve obeyed the order sincerely. However, I collected the following stuffs several months back at the instance of hearing about import inference and I hate to see them getting obsolete. At that time, collecting these modules and getting through them could help me complete the first step toward the quest Learning Python (the first posting of this slog).

There are quite many websites dedicated to python as you already know. Some of them talk only to astronomers. A tiny fraction of those websites are for statisticians but I haven’t met any statistician preferring only python. We take the gist of various languages. So, I’ll leave a general website aggregation, such as AstroPy (I think this website is extremely useful for astronomers), to enrich your bookmark under the “python” tab regardless of your profession. Instead, I’ll discuss some python libraries and modules that can be useful for those exercising astrostatistics and make their work easier. I must say that by intention I omitted a few modules because I was not sure their publicity and copyright sensitivity. If you have modules that can be introduced publicly, let me know. I’ll be happy to add them. If my description is improper and want them to be taken off, also let me know.

Over the past few years, python became the most common and versatile script language for both communities, and therefore, I believe, it would accelerate many collaborations. Much of my time is spent to find out how to read, maneuver, and handle raw data/image. Most of tactics for astronomers are quite unfamiliar, sometimes insensible to me (see my read.table() and data analysis system and its documentation). Somehow, one script language, thanks to its open and free intention to all communities, is promising by narrowing the gap for prosperous and efficient collaborations, Python

The first posting on this slog was about Python. I thought that kicking off with a computer language relatively new and open to many communities could motivate me and others for more interdisciplinary works with diversity. After a few years, unfortunately, I didn’t achieve that goal. Yet, I still think that these libraries and modules, introduced below, to be useful for your transition from some programming languages, or for writing your own but pro bono wrapper for better communication with the others.

I’ll take numpy, scipy, and RPy for granted. For the plotting purpose, matplotlib seems most common.

Reading astronomical data (click links to download libraries, modules, and tutorials)

First, start with Using Python for Interactive Data Analysis (in pdf) Quite useful manual, particularly for IDL users. It compares pros and cons of Python and IDL.
IDLsave Simply, without IDL, a .save file becomes legible. This is a brilliant small module.
PyRAF (I was really frustrated with IRAF and spent many sleepless nights. Apart from data reduction, I don’t remember much of statistics from IRAF except simple statistics for Gaussian populations. I guess PyRAF does better job). And there’s PyFITS for handling fits format data.
APLpy (the Astronomical Plotting Library in Python) is a Python module aimed at producing publication-quality plots of astronomical imaging data in FITS format (this introduction is copied from the APLpy site).

Statistics, Mathematics, or data science
Due to RPy, introducing smaller modules seems not much worthy but quite many modules and library for statistics are available, not relying on R.

MDP (Modular toolkit for Data Processing)
Multivariate data analysis methods like PCA, ICA, FA, etc. become very popular in the astronomical society.
pywavelets (Not only FT, various transformation methodologies are often used and wavelet transformation ranks top).
PyIMSL (see my post, PyIMSL)
PyMC I introduced this module in a century ago. It may be lack of versatility or robustness due to parametric distribution objects but I liked the tutorial very much from which one can expand and devise their own working MCMC algorithm.
PyBUGS (I introduced this python wrapper in BUGS but the link to PyBUGS is not working anymore. I hope it revives.)
SAGE (Software for Algebra and Geometry Experimentation) is a free open-source mathematics software system licensed under the GPL (Link to the online tutorial).
python_statlib descriptive statistics for the python programming language.
PYSTAT Nice website but the product is not available yet. Be aware! It is not PhyStat!!!

Module for AstroStatistics
import inference (Unfortunately, the links to examples and tutorial are not available currently)

Without clear objectives, it is not easy to pick up a new language. If you are used to work with one from alphabet soup, you most likely adhere to your choice. Changing alphabets or transferring language names only happens when your instructor specifically ask you to use their preferring languages and when analysis {modules, libraries, tools} are only available within that preferred language. Somehow, thanks to the object oriented style, python makes transition and communication easier than other languages. Furthermore, script languages are more intuitive and better interpretable.

Where is ciao X ?

hlee — Thu, 30 Jul 2009 06:57:00 +0000

X={ primer, tutorial, cookbook, Introduction, guidebook, 101, for dummies, …}

I’ve heard many times about the lack of documentation of this extensive data analysis system, ciao. I saw people still using ciao 3.4 although the new version 4 has been available for many months. Although ciao is not the only tool for Chandra data analysis, it was specifically designed for it. Therefore, I expect it being used frequently with popularity. But the reality is against my expectation. Whatever (fierce) discussion I’ve heard, it has been irrelevant to me because ciao is not intended for statistical analysis. Then, out of sudden, after many months, a realization hit me. ciao is different from other data analysis systems and softwares. This difference has been a hampering factor of introducing ciao outside the Chandra scientist community and of gaining popularity. This difference was the reason I often got lost in finding suitable documentations.

http://cxc.harvard.edu/ciao/ is the website to refer when you start using ciao and manuals are listed here, manuals and memos. The aforementioned difference is that I’m used to see Introduction, Primer, Tutorial, Guide for Beginners at the front page or the manual websites but not from the ciao websites. From these introductory documentations, I can stretch out to other specific topics, modules, tool boxes, packages, libraries, plug-ins, add-ons, applications, etc. Tutorials are the inertia of learning and utilizing data analysis systems. However, the layout of ciao manual websites seems not intended for beginners. It was hard to find basics when some specific tasks with ciao and its tools got stuck. It might be useful only for Chandra scientists who have been using ciao for a long time as references but not beyond. It could be handy for experts instructing novices by working side by side so that they can give better hands-on instruction.

I’ll contrast with other popular data analysis systems and software.

When I began to use R, I started with R manual page containing this pdf file, Introduction to R. Based on this introductory documentations, I could learn specific task oriented packages easily and could build more my own data analysis tools.
When I began to use Matlab, I was told to get the Matlab primer. Although the current edition is commercial, there are free copies of old editions are available via search engines or course websites. There other tutorials are available as well. After crashing basics of Matlab, it was not difficult to getting right tool boxes for topic specific data analysis and scripting for particular needs.
When I began to use SAS (Statistical Analysis System), people in the business said get the little SAS book which gives the basis of this gigantic system, from which I was able to expend its usage for particular statistical projects.
Recently, I began to learn Python to use many astronomical and statistical data analysis modules developed by various scientists. Python has its tutorials where I can point for basic to fully utilize those task specific modules and my own scripting.
Commericial softwares often come with their own beginners’ guide and demos that a user can follow easily. By acquiring basics from these tutorials, expending applications can be well directed. On the other hands, non-commercial softwares may be lack of extensive but centralized tutorials unlike python and R. Nonetheless, acquiring tutorials for teaching is easy and these unlicensed materials are very handy whenever problems are confronted under various but task specific projects.
I used to have IDL tutorials on which I relied a lot to use some astronomy user libraries and CHIANTI (atomic database). I guess the resources of tutorials have changed dramatically since then.

Even if I’ve been navigating the ciao website and its threads high in volume so many times, I only come to realize now that there’s no beginner’s guide to be called as ciao cookbook, ciao tutorial, ciao primer, ciao primer, ciao for dummies, or introduction to ciao at the visible location.

This is a cultural difference. Personal thought is that this tradition prevents none Chandra scientists from using data in the Chandra archive. A good news is that there has been ciao workshops and materials from the workshops are still available. I believe compiling these materials in a fashion that other beginners’ guides introducing the data analysis system can be a good starting point for writing up a front-page worthy tutorial. The existence of this introductory material could embrace more people to use and to explore Chandra X-ray data. I hope these tutorials from other softwares and data analysis systems (primer, cookbook, introduction, tutorial, or ciao for dummies) can be good guide lines to fully compose a ciao primer.

a century ago

hlee — Thu, 07 May 2009 19:22:37 +0000

Almost 100 years ago, A.S. Eddington stated in his book Stellar Movements (1914) that

…in calculating the mean error of a series of observations it is preferable to use the simple mean residual irrespective of sign rather than the mean square residual

Such eminent astronomer said already least absolute deviation over chi-square, if I match simple mean residual and mean square residual to relevant methodologies, in order.

I guess there is a reason everything is done based on the chi-square although a small fraction of the astronomy community is aware of that the chi-square minimization is not the only utility function for finding best fits. The assumption that the residuals “(Observed – Expected)/sigma”, as written in chi-square methods, are (asymptotically) normal – Gaussian, is freely taken into account by astronomical data (astronomers who analyze these data) mainly because of their high volume. The worst case is that even if checking procedures for the Gaussianity are available from statistical literature, applying those procedures to astronomical data is either difficult or ignored. Anyway, if one is sure that the data/parameters of interest are sampled from normal distribution, Eddington’s statement is better to be reverted because of sufficiency. We also know the asymptotic efficiency of sample standard deviation when the probability density function satisfies more general regularity conditions than the Gaussian density.

As a statistician, it is easy to say, “assume that data are iid standard normal, wlog.” And then, develop a statistic, investigate its characteristics, and compare it with other statistics. If this statistics does not show promising results from the comparison and strictly suffers from this normality assumption, then statisticians will attempt to make this statistic robust by checking and relaxing assumptions and math. On the other hand, I’m not sure how much astronomers feel easy with this Gaussianity assumption in their data most of which are married to statistics or statistical tools based on the normal assumption. How often have the efforts of devising the statistic and trying different statistics been taken place?

Without imposing the Gaussianity assumption, I think that Eddington’s comment is extremely insightful. Commonly cited statistical methods in astronomy, like chi square methods, are built on Gaussianity assumption from which sample standard deviation is used for σ, the scale parameter of the normal distribution that is mapped to 68% coverage and multiple of the sample standard deviation correspond to well known percentiles as given in Numerical Recipes. In the end, I think statistical analysis in astronomy literature suffers from a dilemma, “which came first, the chicken or the egg?” On the other hand, I feel setback because such a insightful comment from one of the most renown astrophysicists didn’t gain much weight after many decades. My understanding that Eddington’s suggestion was ignored is acquired from reading only a fraction of publications in major astronomical journals; therefore, I might be wrong. Probably, astronomers use LAD and do robust inferences more often that I think.

Unless not sure about the Gaussianity in data (covering the sampling distribution, residuals between observed and expected, and some transformations), for inference problems, sample standard deviation may not be appropriate to get error bars with matching coverage. Estimating posterior distributions is a well received approach among some astronomers and there are good tutorials and textbooks about Bayesian data analysis for astronomers. Those familiar with basics of statistics, pyMC and its tutorial (or another link from python.org) will be very useful for proper statistical inference. If Bayesian computation sounds too cumbersome, for the simplicity, follow Eddington’s advice. Instead of sample standard deviation, use absolute mean deviation (simple mean residual, Eddington’s words) to quantify uncertainty. Perhaps, one wants to compare best fits and error bars from both strategies.

——————————————————
This story was inspired by Studies in the Hisotry of Probability and Statistics. XXXII: Laplace, Fisher, and the discovery of the concept of sufficiency by Stigler (1973) Biometrika v. 60(3), p.439. The quote of Eddington was adapted from this article. Another quote from this article I like to share:

Fisher went on to observe that this property of σ₂^[1] is quite dependent on the assumption that the population is normal, and showed that indeed σ₁^[2] is preferable to σ₂, at least in large samples, for estimating the scale parameter of the double exponential distribution, providing both estimators are appropriately rescaled

By assuming that each observations is normally (Gaussian) distributed with mean (mu) and variance (sigma^2), and that the object was to estimate sigma, Fisher proved that the sample standard deviation (or mean square residual) is more efficient than the mean deviation form the sample mean (or simple mean residual). Laplace proved it as well. The catch is that assumptions come first, not the sample standard deviation for estimating error (or sigma) of unknown distribution.

sample standard deviation
mean deviation from the sample mean

BUGS

hlee — Tue, 16 Sep 2008 20:34:23 +0000

Astronomers tend to think in Bayesian way, but their Bayesian implementation is very limited. OpenBUGS, WinBUGS, GeoBUGS (BUGS for geostatistics; for example, modeling spatial distribution), R2WinBUGS (R BUGS wrapper) or PyBUGS (Python BUGS wrapper) could boost their Bayesian eagerness. Oh, by the way, BUGS stands for Bayesian inference Using Gibbs Sampling.

Disclaimer: I never did serious Bayesian computations so that information I provide here tends to be very shallow. Both statisticians and astronomers oriented by Bayesian ideals are very welcome to add advanced pieces of information.

Bayesian statistics is very much preferred in astronomy, at least here at Harvard Smithsonian Center for Astrophysics. Yet, I do not understand why astronomy data analysis packages do not include libraries, modules, or toolboxes for MCMC (porting scripts from Numerical Recipes or IMSL, or using Python does not count here since these are also used by engineers and scientists of other disciplines: my view is also restricted thanks to my limited experience in using astronomical data analysis packages like ciao, XSPEC, IDL, IRAF, and AIPS) similar to WinBUGS or OpenBUGS. Most of Bayesian analysis in astronomy has to be done from the scratch, which drives off simple minded people like me (I prefer analytic forms and estimators than posterior chains). I hope easily implementable Bayesian Data Analysis modules come along soon to current astronomical data analysis systems for any astronomers who only had a lecture about Bayes theorem and Gibbs sampling. Perhaps, BUGS can be a role model to develop such modules.

As listed, one does not need R to use BUGS. WinBUGS is both stand alone and R implementable. PyBUGS can be handy since python is popular among astronomers. I heard that MATLAB (its open source counterpart, OCTAVE) has its own tools to maneuver Bayesian Data Analysis relatively easily. There are many small MCMC modules to solve particular problems in astronomy but none of them are reported to be robust enough so as to be applied in other type data sets. Not many have the freedom of choosing models and priors.

Hopefully, well knowledged Bayesians contribute in developing modules for Bayesian data analysis in astronomy. I don’t like to see contour plots, obtained from brute-forceful and blinded χ² fitting, claimed to be bivariate probability density profiles. I’d like to project the module development like the way that BUGS is developed in astronomical data analysis packages with various Bayesian libraries. Here are some web links about BUGS:
The BUGS Project
WinBUGS
OpenBUGS
Calling WinBUGS 1.4 from other programs

PyIMSL

hlee — Thu, 28 Aug 2008 00:08:14 +0000

PyIMSL is a collection of Python wrappers to the math and statistical algorithms in the IMSL C Numerical Library^[1]. I recall the days of digging in IMSL (International Mathematics and Statistics Library) user manuals and learning Fortran and C to use this vast library (Splus was to slow at that time). Upon knowing that Python is very favored among astronomers (click here to see the slog posts about Python) and that limits exist in Numerical Recipes (I didn’t check the latest version published last year, though), probably IMSL is useful for mathematical and statistical analysis for astronomers.

To know more, check these websites.
Press release about PyIMSL
Visual Numerics’ IMSL Numerical Libraries and
IMSL Libraries Technical Documentation (one can check what mathematical and statistical analysis tools are available from those documentations)

cited from http://en.wikipedia.org/wiki/IMSL

R-[{Perl,Python}] Interface

hlee — Tue, 13 May 2008 19:47:49 +0000

The brackets could be filled with other languages but two are introduced today: Perl (perl.org) and Python (python.org). These two are widely used among astronomers and can be empowered by R (r-project.org).

R/SPlus – Perl Interface
R/SPlus – Python Interface

Posts on R and Python from the slog
Learning R
Learning Python

Learning Python

hlee — Mon, 22 Jan 2007 09:08:36 +0000

Both in astronomy and statistics, python is recognized as a versatile programming language. I asked python tutorials to Alanna. The following is her answer, which looks very useful for those who wish to learn python.

————————————————————————
1/ Python basics:
My favorite Intro to Python website is the Tutorial by Guido van Rossum (python founder):
http://docs.python.org/tut/

I also find other references at this site to be useful: http://docs.python.org

For more complicated questions I often search: http://www.python.org/

2/ Scientific python: numarray, numpy These modules allow one to use APL/IDL- like syntax with matrices (i.e. implicit loops over indices when doing many common operations). They also have some handy scientific functions. Eventually “numarray” will be replaced by “numpy”, but it hasn’t happened yet (because of “pyfits” for fits files). It should happen this year (November??).

Numarray home page: http://www.stsci.edu/resources/software_hardware/numarray

Numpy home pages: http://sourceforge.net/projects/numpy/ for code; and http://numpy.scipy.org/ for an overview.

3/ For fits files, the astronomical programmers at Space Telescope Science Insititute alse wrote “pyfits”: http://www.stsci.edu/resources/software_hardware/pyfits