The AstroStat Slog » ascii

accessing data, easier than before but…

hlee — Tue, 20 Jan 2009 17:59:56 +0000

Someone emailed me for globular cluster data sets I used in a proceeding paper, which was about how to determine the multi-modality (multiple populations) based on well known and new information criteria without binning the luminosity functions. I spent quite time to understand the data sets with suspicious numbers of globular cluster populations. On the other hand, obtaining globular cluster data sets was easy because of available data archives such as VizieR. Most data sets in charts/tables, I acquire those data from VizieR. In order to understand science behind those data sets, I check ADS. Well, actually it happens the other way around: check scientific background first to assess whether there is room for statistics, then search for available data sets.

However, if you are interested in massive multivariate data or if you want to have a subsample from a gigantic survey project, impossible all to be documented in contrast to those individual small catalogs, one might like to learn a little about Structured Query Language (SQL). With nice examples and explanation, some Tera byte data are available from SDSS. Instead of images in fits format, one can get ascii/table data sets (variables of million objects are magnitudes and their errors; positions and their errors; classes like stars, galaxies, AGNs; types or subclasses like elliptical galaxies, spiral galaxies, type I AGN, type Ia, Ib, Ic, and II SNe, various spectral types, etc; estimated variables like photo-z, which is my keen interest; and more). Furthermore, thousands of papers related to SDSS are available to satisfy your scientific cravings. (Here are Slog postings under SDSS tag).

If you don’t want to limit yourself with ascii tables, you may like to check the quick guide/tutorial of Gator, which aggregated archives of various missions: 2MASS (Two Micron All-Sky Survey), IRAS (Infrared Astronomical Satellite), Spitzer Space Telescope Legacy Science Programs, MSX (Midcourse Space Experiment), COSMOS (Cosmic Evolution Survey), DENIS (Deep Near Infrared Survey of the Southern Sky), and USNO-B (United States Naval Observatory B1 Catalog). Probably, you also want to check NED or NASA/IPAC Extragalactic Database. As of today, the website said, 163 million objects, 170 million multiwavelength object cross-IDs, 188 thousand associations (candidate cross-IDs), 1.4 million redshifts, and 1.7 billion photometric measurements are accessible, which seem more than enough for data mining, exploring/summarizing data, and developing streaming/massive data analysis tools.

Probably, astronomers might wonder why I’m not advertising Chandra Data Archive (CDA) and its project oriented catalog/database. All I can say is that it’s not independent statistician friendly. It is very likely that I am the only statistician who tried to use data from CDA directly and bother to understand the contents. I can assure you that without astronomers’ help, the archive is just a hot potato. You don’t want to touch it. I’ve been there. Regardless of how painful it is, I’ve kept trying to touch it since It’s hard to resist after knowing what’s in there. Fortunately, there are other data scientist friendly archives that are quite less suffering compared to CDA. There are plethora things statisticians can do to improve astronomers’ a few decade old data analysis algorithms based on Gaussian distribution, iid assumption, or L₂ norm; and to reflect the true nature of data and more relaxed assumptions for robust analysis strategies than for traditionally pursued parametric distribution with specific models (a distribution free method is more robust than Gaussian distribution but the latter is more efficient) not just with CDA but with other astronomical data archives. The latter like vizieR or SDSS provides data sets which are less painful to explore with without astronomical software/package familiarity.

Computer scientists are well aware of UCI machine learning archive, with which they can validate their new methods with previous ones and empirically prove how superior their methods are. Statisticians are used to handle well trimmed data; otherwise we suggest strategies how to collect data for statistical inference. Although tons of data collecting and sampling protocols exist, most of them do not match with data formats, types, natures, and the way how data are collected from observing the sky via complexly structured instruments. Some archives might be extensively exclusive to the funded researchers and their beneficiaries. Some archives might be super hot potatoes with which no statistician wants to involve even though they are free of charges. I’d like to warn you overall not to expect the well tabulated simplicity of text book data sets found in exploratory data analysis and machine learning books.

Some one will raise another question why I do not speculate VOs (virtual observatories, click for slog postings) and Google Sky (click for slog postings), which I praised in the slog many times as good resources to explore the sky and to learn astronomy. Unfortunately, for the purpose of direct statistical applications, either VOs or Google sky may not be fancied as much as their names’ sake. It is very likely spending hours exploring these facilities and later you end up with one of archives or web interfaces that I mentioned above. It would be easier talking to your nearest astronomer who hopefully is aware of the importance of statistics and could offer you a statistically challenging data set without worries about how to process and clean raw data sets and how to build statistically suitable catalogs/databases. Every astronomer of survey projects builds his/her catalog and finds common factors/summary statistics of the catalog from the perspective of understanding/summarizing data, the primary goal of executing statistical analyses.

I believe some astronomers want to advertise their archives and show off how public friendly they are. Such advertising comments are very welcome because I intentionally left room for those instead of listing more archives I heard of without hands-on experience. My only wish is that more statisticians can use astronomical data from these archives so that the application section of their papers is filled with data from these archives. As if with sunspots, I wish that more astronomical data sets can be used to validate methodologies, algorithms, and eventually theories. I sincerely wish that this shall happen in a short time before I become adrift from astrostatistics and before I cannot preach about the benefits of astronomical data and their archives anymore to make ends meet.

There is no single well known data repository in astronomy like UCI machine learning archive. Nevertheless, I can assure you that the nature of astronomical data and catalogs bear various statistical problems and many of those problems have never been formulated properly towards various statistical inference problems. There are so many statistical challenges residing in them. Not enough statisticians bother to look these data because of the gigantic demands for statisticians from uncountably many data oriented scientific disciplines and the persistent shortage in supplies.

read.table()

hlee — Mon, 27 Oct 2008 15:05:27 +0000

The first step of data analysis or applications is reading the data sets into a tool of choice. Recent years, I’ve been using R (see also Learning R) for that regard but I’ve enjoyed freedoms for the same purpose from these languages and tools: BASIC, fortran77/90/95, C/C++, IDL, IRAF, AIPS, mongo/supermongo, MATLAB, Maple, Mathematica, SAS, SPSS, Gauss, ARC, Minitab, and recently Python and ciao which I just began to learn. Many of them I lost the fluency of how to use it. Quick learning tends to be flash memory. Some will need brain defragmentation and recovering time for extensive scientific work. A few I don’t like to use at all. No matter what, I’m not a computer geek. I’m not good at new gadgets, new softwares, nor welcome new and allegedly versatile computing systems. But one must be if he/she want to handle data. Until recently I believed R has such versatility in the aspect of reading in data. Yet, there is nothing without exceptions.

From time to time, I talked about among many factors, FITS format data make it difficult statisticians and astronomers work together. Statisticians cannot read in FITS format unless astronomers convert it into ascii or jpeg format for them whereas astronomers do not want to wasted their busy time for doing a chore like file format conversion wasting computer resources as well. Only a peaceful reunion happens when the data analysis become intractable via traditional methodology described in Numerical Recipes or Bevington and Robinson. They realize statistical (new) theory need to be found and collaboration happens with involvement of graduate students from both fields who patiently do many tedious jobs while learning (I missed this part while I was graduate student, which sometimes I thank my advisor for).

Now, let’s get back to the title. read.table()^[1] is a commonly used command line in R when you read in data in ascii format. It’ reads in data intelligently. As I said, it has been versatile enough. Numerals are in numeric format, letters are character format, missings are stored as NA, etc. read.table() make it easy to jump into data analysis right away. Well, now you know why I write this. I confronted a case read.table() does not read things correctly with astronomical data “even in ascii format.,” which I never had since I began to use S-Plus/R.

Although I know how to fix this simple problem that I’ll describe later, I want to point out the lack of compatibility in data formats between two communities and the common tools used for accessing data sets, which, I believe, is one of the biggest factors that prohibit astronomically uneducated statisticians from participating collaborations. I’ve mixed up tools for consulting courses to assist clients of various disciplines (grad students from agriculture, horticulture, physiology, social science, psychology were my clients) and for executing projects in electrical engineering and computational physics (these heavily rely on MATLAB) but reading data was the most simplest and fundamental step that I don’t have to worry about across various data sets with R (probably, those graduate students and professors of engineering and physics provided well trimmed and proven data sets).

When you have a long way to complete your mission and when you stumbled with your first step, I think it’s easy to loose eagerness for the future unless there’s support from your colleagues. Instead, I mostly likely receive discouraging comments such as “Why using R?” “You won’t have such problems if you use other tools” (Although it takes a bit of extra time to manuever, I eventually get to there). Such frustrating comments also degrade eagerness furthermore. So, from 100% I normally begin with, only 25% eagerness is left after two discouraging moments occurred at the initial step of data analysis whose end is invisibly far away. I only hang on to this 25%, still big by the normal standard and I wish for this last long until the final step without exponential decays that happened at the beginning.

Ah, the example, I promised. Click here for one example (from XAtlas) and check if read.table() can do the job in an one shot when the 3rd column is your x and the 4th column is your y. It’ll produce a beautiful spectrum if the data points are read in properly as numerals. My trick was using awk to extract those two columns because of unequal row entries in columns and read that into R. Such two steps work unfortunately made read.table() of R recognized entries as categorical data. To remove the episode of R recognizing entries as categorical data, between two steps, you must to fix the cause that read.table() reads what looks like numerals into categorical. If you investigate the data set files carefully you’ll find why; however, it’s a bit of tedious job when one have thousand entries in each data file and there are numerous data files. Without information, this effort will be same as writing a line of scanf()/READ in C/Fortran by counting column by column to type correct floating point format. This manifest the differences of formatting tables between astronomers and statisticians including scientists from ecometrics, econometrics, psycometrics, biometrics, bioinformatics, and others that include statistics related suffix.

Except such artifact (or cultural difference), XAtlas is a great catalog for statisticians in functional data analysis, who look for examples to deal with non smooth curves. New strategies and statistical applications will help astronomers see such unprecedented data sets better. Perhaps, actually more certainty, your 25% will grow back to 100% once you see those spectra and other metrics on your own plotting windows.

click here for the explanation of the read.table() function and
click here for the reason why is read.table() so inefficient?

FITS to ASCII

hlee — Sun, 28 Jan 2007 03:47:57 +0000

Generally, astronomical data archives are open to public. Also, astronomy has been the leading force of developing software and hardware to handle massive data, which nowadays receive spotlights from statistics. Although the astronomical data look easy to be accessed for some statistical challenges, compared to data sets of other disciplines, statistical applications on astronomical data are unlikely to be found. What is the cause of this long engagement period?

The data format…so called FITS.

Yet, ASCII format catalogues are available (for example, VizieR, http://vizier.u-strasbg.fr/viz-bin/VizieR) . Catalogues or data from VizieR are trimmed to enhance astrophysical interests. Many data sets contain only a few dozens of stars.

Therefore, for an interesting statistical research, one has to dig in raw data sets, which in astronomy are usually stored in FITS format. It seems like that not many statisticians are aware of this particular type of data format. Let’s defer
a discussion on FITS (need a help from astronomers). I only like to comment that there are ways for statisticians to access FITS format data.

1/ Bother astronomers. They will help you.

2/ R or SAS does not have FITS reader but there are free tools available. Here, I attach an email from 2006 Astrostatics Summer School at PSU, who kindly answer to my question on how to read files in FITS format.

—————————————from Patrick
The FTOOLS package I mentioned today is available at NASA-Goddard via
http://heasarc.gsfc.nasa.gov/ftools/ftools_menu.html
The tool is fdump. It’s manual page is available via fhelp fdump

A more modern and perhaps better package of FITS tools is contained in the data analysis package named CIAO, built for the Chandra mission (the NASA satellite Eric and I work with). It’s available via
http://asc.harvard.edu/ciao/download/
(You do NOT need the separate “calibration database” (CALDB) the you’ll see mentioned.) The tool you want in this package is called dmlist, with a syntax like this to dump the data to an ASCII file:
dmlist mydata.fits opt=data,clean outfile=mydata.txt

To see the column names for a table use:
dmlist mydata.fits cols

To see the keywords in the header of the FITS file use
dmlist mydata.fits header

FITS files can sometimes contain multiple data tables. To see the structure of your FITS file use this:
dmlist mydata.fits blocks

Anytime you specify a FITS filename you can optionally specify a block number, e.g.:
dmlist “mydata.fits[4]” opt=data,clean outfile=mydata_part4.txt

You can even specify specific columns to dump, e.g.
dmlist “mydata.fits[4][cols x,y]” opt=data,clean outfile=mydata_part4.txt

The manual page for dmlist is at http://asc.harvard.edu/ciao/ahelp/
dmlist.html or via
ahelp dmlist

If you ever need to look at images in FITS format the gold standard application is called “ds9″, included in the CIAO package.

—————————————————————

The 1st strategy looks easy and urges collaborations between astronomers and statisticians. On the other hand, astronomically knowledgeable statisticians may work on their own to verify their theories purely for statistical interests and eventually satisfy astrophysics. Overall, don’t be panic with this strange data format.

**Note that this email directs CIAO package on which CHANDRA X-ray data are processed.