The AstroStat Slog » Stars Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 An Instructive Challenge Tue, 15 Jun 2010 18:38:56 +0000 vlk This question came to the CfA Public Affairs office, and I am sharing it with y’all because I think the solution is instructive.

A student had to figure out the name of a stellar object as part of an assignment. He was given the following information about it:

  • apparent [V] magnitude = 5.76
  • B-V = 0.02
  • E(B-V) = 0.00
  • parallax = 0.0478 arcsec
  • radial velocity = -18 km/s
  • redshift = 0 km/s

He looked in all the stellar databases but was unable to locate it, so he asked the CfA for help.

Just to help you out, here are a couple of places where you can find comprehensive online catalogs:

See if you can find it!

Answer next week month.

Update (2010-aug-02):
The short answer is, I could find no such star in any commonly available catalog. But that is not the end of the story. There does exist a star in the Hipparcos catalog, HIP 103389, that has approximately the right distance (21 pc), radial velocity (-16.1 km/s), and V magnitude (5.70). It doesn’t match exactly, and the B-V is completely off, but that is the moral of the story.

The thing is, catalogs are not perfect. The same objects often have very different numerical entries in different catalogs. This could be due to a variety of reasons, such as different calibrations, different analysers, or even intrinsic variations in the source. And you can bet your bottom dollar that the quoted statistical uncertainties in the quantities do not account for the observed variance. Take the B-V value, for instance. It is 0.5 for HIP 103389, but the initial problem stated that it was 0.02, which makes it an A type star. But if it were an A type star at 21 pc, it should have had a magnitude of V~1.5, much brighter than the required 5.76!

I think this illustrates one of the fundamental tenets of science as it is practiced, versus how it is taught. The first thing that a practicing scientist does (especially one not of the theoretical persuasion) is to try and see where the data might be wrong or misleading. It should only be included in analysis after it passes various consistency checks and is deemed valid. The moral of the story is, don’t trust data blindly just because it is a “number”.

]]> 0
SDO launched Thu, 11 Feb 2010 19:04:00 +0000 vlk The Solar Dynamics Observatory, which promises a flood of data on the Sun, was launched today from Cape Kennedy.

]]> 0
SINGS Wed, 07 Oct 2009 01:30:41 +0000 hlee

From SINGS (Spitzer Infrared Nearby Galaxies Survey): Isn’t it a beautiful Hubble tuning fork?

As a first year graduate student of statistics, because of the rumor that Prof. C.R.Rao won’t teach any more and because of his fame, the most famous statistician alive, I enrolled his “multivariate analysis” class without thinking much. Everything is smooth and easy for him and he has incredible memories of equations and proofs. However, I only grasped intuitive concepts like why the method works, not details of mathematics, theorems, and their proofs. Instantly, I began to think how methods can be applied to astronomical data. After a few lessons, I desperately wanted to try out multivariate analysis methods to classify galactic morphology.

The dream died shortly because there’s no data set that can be properly fed into statistical methods for classification. I spent quite time on searching some astronomical data bases including ADS. This was before SDSS or VizieR become popular as now. Then, I thought about applying them to classify supernovae because understanding the pattern of their light curves tells a lot of the history of our universe (Type Ia SNe are standard candle) and because I know some publicly available SN light curves. Immediately, I realize that individual light curves are biased from the sampling perspective. I do not know how to correct them for multivariate analysis. I also thought about applying multivariate analysis methods to stellar spectral types and stars of different mechanical systems (single, binary, association, etc). I thought about how to apply newly learned methods to every astronomical objects that I learned, from sunspots to AGNs.

Regardless of target objects to be scrutinized under this fascinating subject “multivariate analysis,” two factors kept discouraged me: one was that I didn’t have enough training to develop new statistical models in a couple of weeks to reflect unique statistical challenges embedded in data that have missings, irregularities, non-iid, outliers and others that are hardly transcribed into statistical setting, and the other, which was more critical, was that no accessible astronomical database repository for statistical learning. Without deep knowledge in astronomy and trained skills to handle astronomical data, catalogs are generally useless. Those catalogs and data sets in archives are different from data sets from data repositories in machine learning (these data sets are intuitive).

Astronomers would think analyzing toy/mock data sets is not scientific because it’s not leading to any new discovery which they always make. From data analyst viewpoints, scientific advances mean finding tools that summarize data in an optimal manner. As I demanded in Astroinformatics, methods for retrieving information can be attempted and validated with well understood, astrophysically devastated data sets. Pythagoras theorem was proved not only once but there are 39 different ways to prove it.

Seeing this nice poster image (the full resolution image of 56MB is available from the link), brought me some memory of my enthusiasm of applying statistical learning methods for better knowledge discovery. As you can see there are so many different types of galaxies and often times there is no clear boundary between them – consider classifying blurry galaxies by eyes: a spiral can be classified as a irregular, for example. Although I wish for automatic classification of these astrophysical objects, because of difficulties in composing a training set for classification or collecting data of distinctive manifold groups for clustering, as much as complexity that this tuning fork shows, machine learning procedures is equally complicated to be developed. Complex topology of astronomical objects seems to be the primary reason of lacking in statistical learning applications compared to other fields.

Nonetheless, multivariable analysis can be useful for viewing relations from different perspectives, apart from known physics models. It may help to develop more fine tuned physics model by taking latent variables into account that are found from statistical learning processes. Such attempts, I believe, can assist astronomers to design telescopes and to invent efficient ways to collect/analyze data by knowing which features are more significant than others to understand morphological shape of galaxies, patterns in light curves, spectral types, etc. When such experiences accumulate, different insights of physics can kick in like scientists scrambled and assembled galaxies into a tuning fork that led developing various evolution models.

To make a long story short, you have two choices: one, just enjoy these beautiful pictures and apprehend the complexity of our universe, or two, this picture of Hubble’s tuning fork can be inspirational to you for advances in astroinformatics. Whichever path you choose, it’s your time worthy.

]]> 0
space weather Thu, 21 May 2009 22:55:26 +0000 hlee Among billion objects in our Galaxy, outside the Earth, our Sun drags most attention from astronomers. These astronomers go by solar physicists, who enjoy the most abundant data including 400 year long sunspot counts. Their joy is not only originated from the fascinating, active, and unpredictable characteristics of the Sun but also attributed to its influence on our daily lives. Related to the latter, sometimes studying the conditions on the Sun is called space weather forecast.

With my limited knowledge, I cannot lay out all important aspects in solar physics, climate changes (not limited to our lower atmosphere but covering the space between the sun and the earth) due to solar activities, and the most important issues of recent years related to space weather. Only I can emphasize that compared to earth climate/atmosphere or meteorology, contribution from statisticians to space weather is almost none existing. I’ve witnessed frequently that crude eyeballing instead of statistics in analyzing data and quantifying images occurs in Solar Physics. Luckily, a few articles discussing statistics are found and my discussion is rather focused on these papers while leaving a room for solar physicists to chip in how space weather is dealt statistically for collaborating with statisticians.

By the way, I have no intention of degrading “eyeballing” in data analysis by astronomers. Statistical methods under EDA, exploratory data analysis whose counterpart is CDA, confirmatory data analysis, or statistical inference, is basically “eyeballing” with technical jargon and basics from probability theory. EDA is important to doubt every step in astronomers’ chi-square methods. Without those diagnostics and visualization, choosing right statistical strategies is almost impossible with real data sets. I used “crude” because instead of using “edge detection” algorithms, edges are drawn by hand via eyeballing. Also, my another disclaimer is that there are brilliant image processing/computer vision strategies developed by astronomers, which I’m not going to present. I’m focusing on small areas in statistics related to space weather and its forecasting.

Statistical Assessment of Photospheric Magnetic Features in Imminent Solar Flare Predictions by Song et al. (2009) SoPh. v. 254, p.101.

Their forte is “logistic regression” a statistical model that is not often used in astronomy. It is seen when modeling binary responses (or categorical responses like head or tail; agree, neutral, or disgree) and bunch of predictors, i.e. classification with multiple features or variables (astronomers might like to replace these lexicons with parameters). Also, the issue of variable selection is discussed like L_{gnl} to be the most powerful predictor. Their training set was carefully discussed from the solar physical perspective. Against their claim that they used “logistic regression” to predict solar flares for the first time, there was another paper a few years back discussing “logistic regression” to predict geomagnetic storms or coronal mass ejections. This statement can be wrong if flares and CMEs are exclusive events.

The Challenge of Predicting the Occurrence of Intense Storms by Srivastava (2006) J.Astrophys. Astr. v.27, pp.237-242

Probability of the storm occurrence is response in logistic regression model, of which predictors are CME related variables including latitude and longitude of the origin of CME, and interplanetary inputs like shock speeds, ram pressure, and solar wind related measures. Cross-validation was performed. A comment that the initial speed of a CME might be the most reliable predictor is given but no extensive discussion of variable selection/model selection.

Personally speaking, both publications[1] can be more statistically rigorous to discuss various challenges in logistic regression from the statistical learning/classification perspective and from the model/variable selection aspect to define more well behaving and statistically rigorous classifiers.

Often times we plan our days according to the weather forecast (although we grumble weather forecasts are not right, almost everyone relies on numbers and predictions from weather people). Although it may not be 100% reliable, those forecasts make our lives easier. Also, more reliable models are under developing. On the other hand, forecasting space weather with the help of statistics is yet unthinkable. However, scientists and engineers understand that the reliable space weather models help planning space missions and controlling satellites into safety mode. At least I know is that with the presence of flare or CME forecasting models, fewer scientists/engineers need to wake up in the middle of night, because of, otherwise unforeseen storms from the sun.

  1. I thought I collected more papers under “statistics” and “space weather,” not just these two. A few more probably are buried somewhere. It’s hard to believe such rich field is not touched by statisticians. I’d appreciate very much your kind forwarding those relevant papers. I’ll gradually add them.
]]> 0
[MADS] Chernoff face Thu, 02 Apr 2009 16:00:41 +0000 hlee I cannot remember when I first met Chernoff face but it hooked me up instantly. I always hoped for confronting multivariate data from astronomy applicable to this charming EDA method. Then, somewhat such eager faded, without realizing what’s happening. Tragically, this was mainly due to my absent mind.

After meeting Prof. Herman Chernoff unexpectedly – I didn’t know he is Professor Emeritus at Harvard – the urge revived but I didn’t have data, still then. Alas, another absent mindedness: I don’t understand why I didn’t realize that I already have the data, XAtlas for trying Chernoff faces until today. Data and its full description is found from the XAtlas website (click). For Chernoff face, references suggested in Wiki:Chernoff face are good. I believe some folks are already familiar with Chernoff faces from a New York Times article last year, listed in Wiki (or a subset characterized by baseball lovers?).

Capella is a X-ray bright star observed multiple times for Chandra calibration. I listed 16 ObsIDs in the figures below at each face, among 18+ Capella observations (Last time when I checked Chandra Data Archive, 18 Capella observations were available). These 16 are high resolution observations from which various metrics like interesting line ratios and line to continuum ratios can be extracted. I was told that optically it’s hard to find any evidence that Capella experienced catastrophic changes during the Chandra mission (about 10 years in orbit) but the story in X-ray can’t be very different. In a dismally short time period (10 years for a star is a flash or less), Capella could have revealed short time scale high energy activities via Chandra. I just wanted to illustrate that Chernoff faces could help visualizing such changes or any peculiarities through interpretation friendly facial expressions (Studies have confirmed babies’ ability in facial expression recognitions). So, what do you think? Do faces look similar/different to you? Can you offer me astronomical reasons for why a certain face (ObsID) is different from the rest?

faces faces2

p.s. In order to draw these Chernoff faces, check descriptions of these R functions, faces() (yields the left figure) or faces2() (yields the right figure) by clicking on the function of your interest. There are other variations and other data analysis systems offer different fashioned tools for drawing Chernoff faces to explore multivariate data. Welcome any requests for plots in pdf. These jpeg files look too coarse on my screen.

p.p.s. Variables used for these faces are line ratios and line to continuum ratios, and the order of these input variables could change countenance but impressions from faces will not change (a face with distinctive shapes will look different than other faces even after the order of metrics/variables is scrambled or using different Chernoff face illustration tools). Mapping, say from an astronomical metric to the length of lips was not studied in this post.

p.p.p.s. Some data points are statistical outliers, not sure about how to explain strange numbers (unrealistic values for line ratios). I hope astronomers can help me to interpret those peculiar numbers in line/continuum ratios. My role is to show that statistics can motivate astronomers for new discoveries and to offer different graphics tools for enhancing visualization. I hope these faces motivate some astronomers to look into Capella in XAtlas (and beyond) in details with different spectacles, and find out the reasons for different facial expressions in Capella X-ray observations. Particularly, ObsID 1199 is most questionable to me.

]]> 2
after “Thanks to Henrietta Leavitt” Fri, 07 Nov 2008 03:22:26 +0000 hlee flyer
Personally, it was a highly anticipated symposium at CfA because I was fascinated about the female computers’ (or astronomers’) contributions that occurred here about a century ago even though at that time women were not considered as scientists but mere assistants for tedious jobs.

I learned more history particularly about Ms. Henrietta Leavitt who first speculated the period-luminosity relation from Cepheid stars. Her work is a real painstaking task that cannot be compared to finding a needle in a haystack. It’s like finding some needles from a same manufacturer from countless haystacks, which may or may not have a needle from the specific manufacturer. The worst part is, needles are needles. Not many needles have tags like your clothing for an identification.

However, I was disappointed because of two reasons. First is a minor disappointment but very valuable. The author (George Johnson) of the book, Miss Leavitt’s star – I haven’t read, actually I didn’t know it exists until today – answered my question that he does not think Ms Leavitt’s was exposed to statistical research. Finding a relationship between period and luminosity is closely related a simple regression analysis and I thought she knew about statistics to associate her discovery to now so called, the Leavitt’s law. This disappointment actually lead me to question when the statistical analysis kicked in in astronomy, particularly finding relationships in any studies related to the standard candle, to find out the correct estimate of the Hubble constant.

The second reason of my disappointment is very poorly executed statistics. Obviously, it’s not Ms. Leavitt who imposed such strange trend and statistical malpractices (or carelessness) in regression analysis among astronomers. Whenever speakers bring out scatter plots with regression lines and data points with error bars, I keep murmuring silently, “Oh, my God, how come they blindly did that?” There were statistical issues to be addressed prior to stating that their results support a certain hypothesis instead of putting a straight line and claiming that – “see, how good the slope is” – the Hubble constant is # plus minus $. A high leverage point on the right in addition to less than a dozen points clumped in the left corner, without various diagnostics tools in regression analysis, one does not claim that the straight line is a good fit nor can say that the analysis backs up the hypothesis. Perhaps these statistical diagnoses only advocate their concluding estimates and their uncertainty, and so are omitted. However, my feeling upon looking plots tells me that a simple bootstrap could prove that their estimates are not accurate as they think. Until you try, you don’t know, though. I may email those speakers politely if I can have data points they used for their scatter plots. Unfortunately, I know no one is willing to give me their data points for such unjust cause since even good causes, I had experiences of indifference (I myself might do the same if I were in their positions, no complaints!!!).

Regardless of these disappointments from the statistical instinct, it was a scientifically very interesting symposium and like to thank who made great efforts to put things together. It helped me to resolve some of my crave to know about Ms. Leavitt and to satisfy one of my old wishes that her work to be recognized under her name. If there is one, I wish I could have attended the symposium to commemorate the centennial of student-t, this year. It’s always good to know the history better to move forward.

Asides, during G. Johnson’s talk, he showed pictures of apt. building, which I see everyday, that Ms. Leavitt made her residence until her death and of Mr. Auburn cemetery, very beautiful calming place, where she was buried. I wished she had lived longer to see a glimpse of her great contribution to astronomical science.

]]> 0
“Thanks to Henrietta Leavitt” Thu, 06 Nov 2008 10:00:17 +0000 vlk [9/30/2008]

The CfA is celebrating the 100th anniversary of the discovery of the Cepheid period-luminosity relation on Nov 6, 2008. See for details.

[Update 10/03] For a nice introduction to the story of Henrietta Swan Leavitt, listen to this Perimeter Institute talk by George Johnson:

[Update 11/06] The full program is now available. The symposium begins at Noon today.

]]> 1
The Big Picture Mon, 13 Oct 2008 17:07:03 +0000 vlk Our hometown rag (the Boston Globe) runs an occasional series of photo collections that highlight news stories called The Big Picture. This week, they take a look at the Sun:

The pictures come from space and ground observatories, from SoHO, TRACE, Hinode, STEREO, etc. Goes without saying, the images are stunning, and some are even animated. The real kicker is that images such as these are being acquired by the hundreds, every hour upon the hour, 24/7/365.25 . It is like sipping from a firehose. Nobody can sit there and look at them all, so who knows what we are missing out on. Can statistics help? Can we automate a statistically robust “interestingness” criterion to filter the data stream that humans can then follow up on?

]]> 3
Differential Emission Measure [Eqn] Wed, 13 Aug 2008 17:00:13 +0000 vlk Differential Emission Measures (DEMs) are a summary of the temperature structure of the outer atmospheres (aka coronae) of stars, and are usually derived from a select subset of line fluxes. They are notoriously difficult to estimate. Very few algorithms even bother to calculate error envelopes on them. They are also subject to numerous systematic uncertainties which can play havoc with proper interpretation. But they are nevertheless extremely useful since they allow changes in coronal structures to be easily discerned, and observations with one instrument can be used to derive these DEMs and these can then be used to predict what is observable with some other instrument.

The flux at Earth due to an atomic transition u –> l from a volume element δV at a location ɼ,

Iul(ɼ) = (1/4 π) (1/d(ɼ)2) A(Z,ɼ) Gul(ne(ɼ),Te(ɼ)) ne(ɼ)2 δV(ɼ) ,

where ne is the electron density and Te is the temperature of the plasma, A(Z,ɼ) are the abundance of element Z, Gul(ne,Te) is the atomic emissivity for the transition, and d is the distance to the source.

We can combine the flux from all the points in the field of view that arise from plasma at the same temperature,

Iul(Te) = (1/4 π) ∑ɼ|Te (1/d(ɼ)2) A(Z,ɼ) Gul(ne(ɼ),Te) ne2δV(ɼ) .

Assuming that A(Z,ɼ), ne(ɼ) do not vary over the points in the summation,

Iul(Te) ≈ (1 / 4 π d2) Gul(ne,Te) A(Z) ne2 (ΔV / Δlog Te) Δlog Te ,

and hence the total line flux due to emission at all temperatures,

Iul = ∑Te (1 / 4 π d2) A(Z) Gul(ne,Te) DEM(Te) ΔlogTe

The quantity

DEM(Te) = ne2 (ΔV / Δlog Te)

is called the Differential Emission Measure and is a very useful summary of the temperature structure of stellar coronae. It is typically reported in units of [cm-3] (or [cm-5] if ΔV is written out as area*Δh). Sometimes it is defined as ne2(ΔV/ΔT) and has units [cm-3K-1].

The expression for the line flux is an instance of Fredholm’s Equation of the First Kind and the DEM(Te) solution is thus unstable and subject to high-frequency oscillations. There is a whole industry that has grown up trying to derive DEMs from often highly unreliable datasets.

]]> 2
Line Emission [EotW] Wed, 14 May 2008 17:00:23 +0000 vlk Spectral lines are a ubiquitous feature of astronomical data. This week, we explore the special case of optically thin emission from low-density and high-temperature plasma, and consider the component factors that determine the line intensity.

The flux [ergs s-1 cm-2 sr-1] from an optically thin emission line that arises due to a transition between energy levels j and i in an ionic species Z+I is simply written. It is the product of the probability of the transition Aji(Z,I) (aka the Einstein coefficient), the number of particles of the species that exists in the upper level of the transition Nj(Z,I), the volume of the emission dV, and the energy of the emitted photon hc/lambda, scaled down by the distance to the source (4 pi d2; note that the factor 4 pi is due to the emission being radially symmetric).

But this apparently purely atomic calculation can be reformed and rewritten, after some algebra, in terms of quantities that are astrophysically more meaningful. The equations below walk you through the tranformation from atomic physics to quantities that can be separated out into different hierarchies of astrophysical source properties, from things that change not at all from one source to another, to things that are likely not the same even along the line-of-sight.

optically thin line emission

All of the quantities that depend only on the atomic physics can be pulled together into the emissivity of the transition, eji(Ne,Te,Z,I). This is (mostly) independent of the physical conditions at the source, and is generally treated as invariant except for changes due to the electron number density. These can therefore be calculated beforehand, and indeed, codes such as CHIANTI, SPEX, and APEC do just that. The abundance AZ (note, not the Einstein coefficient: apologies for the overlapping notation, can’t be helped for historical reasons) changes from source to source, and sometimes even within a source, but is the stablest of the factors after the emissivity. The ion balance i(Te,Z,I)=NZ,I/NZ is strongly variable, as is the so-called emission measure, EM = Ne2dV, which btw is also a function of Te. The atomic emissivity and the ion balance are sometimes combined together and the product is also confusingly referred to as the emissivity. Strictly speaking, the level population is dependent on the ion fractions and therefore the emissivity cannot be exactly separated from the ion balance. However, this dependence is weak in the density limits we are usually interested in (Ne~108-12 cm-3, as in the solar corona), and the two can be separated.

It is important to note that each of the terms listed above have associated model or measurement uncertainties. Often, the Einstein coefficients and the energy of the emission are not experimentally verified, and the level populations are approximate calculations due to the complexity of the level structure of the species in question. Typical ion balance calculations assume that the plasma is in thermodynamic equilibrium, which is often not a good assumption. Abundances are known to vary radically (by factors greater than 2x) across the source. And finally, except at high temperatures and low density (such as stellar coronae), the assumption of zero opacity (i.e., that any emitted photon escapes to infinity without any scatterings) is not applicable, and radiative transfer effects must be included.

A brief word about the units. Astronomers tend to use cgs, not SI. So the flux usually has units [ergs/s/cm2/sr], the emissivity eji is in [ph cm3/s] (unless the factor hc/lambda is included in the emissivity, in which case the units are [ergs cm3/s]), and the emission measure is in [cm3].

The emission measure is a story by itself, one best left alone for another time.

]]> 2