The AstroStat Slog » question for statisticians

[Q] Objectivity and Frequentist Statistics

vlk — Mon, 29 Sep 2008 06:15:14 +0000

Is there an objective method to combine measurements of the same quantity obtained with different instruments?

Suppose you have a set of N₁ measurements obtained with one detector, and another set of N₂ measurements obtained with a second detector. And let’s say you wanted something as simple as an estimate of the mean of the quantity (say the intensity) being measured. Let us further stipulate that the measurement errors of each of the points is similar in magnitude and neither instrument displays any odd behavior. How does one combine the two datasets without appealing to subjective biases about the reliability or otherwise of the two instruments?

We’ve mentioned this problem before, but I don’t think there’s been a satisfactory answer.

The simplest thing to do would be to simply pool all the measurements into one dataset with N=N₁+N₂ measurements and compute the mean that way. But if the number of points in each dataset is very different, the simple combined sample mean is actually a statement of bias in favor of the dataset with more measurements.

In a Bayesian context, there seems to be at least a well-defined prescription: define a model, compute the posterior probability density for the model parameters using dataset 1 using some non-informative prior, use this posterior density as the prior density in the next step, where a new posterior density is computed from dataset 2.

What does one do in the frequentist universe?

[Update 9/30] After considerable discussion, it seems clear that there is no way to do this without making some assumption about the reliability of the detectors. In other words, disinterested objectivity is a mirage.

loess and lowess and locfit, oh my

chasc — Fri, 25 Jul 2008 17:12:42 +0000

Diab Jerius follows up on LOESS techniques with a very nice summary update and finds LOCFIT to be very useful, but there are still questions about how it deals with measurement errors and combining observations from different experiments:

A couple of weeks ago Vinay suggested using the LOESS algorithm to create smooth curves (separately) through the SSD and FPC points. LOESS has been succeeded by LOWESS and, finally LOCFIT, which is the 800lb gorilla of local regression fitting.

The LOCFIT algorithm uses local regression (i.e. fits over samples of the data) to generate smooth curves. There is an enormous body of literature on this, much of it summarized in the book

Local Regression and Likelikhood, by C. Loader
ISBN 0-387-98775-4

which also serves as documentation for the LOCFIT software. The techniques seem well established and accepted by the statistical community.

LOCFIT looks to be a very elegant approach, but, unfortunately, I have still not been able to glean any information as to how one introduces experimental errors into the regressions. The voluminous research in this field certainly deals with experimental data, so I’m not quite sure what to make of this.

One way around this might be to take a Monte-Carlo approach: resample the data using the experimental errors, generate a new smoothing function, and generate a measure of the distribution of the fit functions.

For those interested, I have a copy of the above book on loan.
It’s fascinating reading.

More about the actual code is available at this web site:
http://locfit.herine.net/

In addition, Ping Zhao asks: (paraphrasing) if you combine two separate sets of observations with vastly different numbers of data points in each, how do you weight them during a combined loess/lowess/locfit fit?

Comments and suggestions from statisticians are much appreciated!

Q: Lowess error bars?

vlk — Tue, 03 Jun 2008 06:53:14 +0000

It is somewhat surprising that astronomers haven’t cottoned on to Lowess curves yet. That’s probably a good thing because I think people already indulge in smoothing far too much for their own good, and Lowess makes for a very powerful hammer. But the fact that it is semi-parametric and is based on polynomial least-squares fitting does make it rather attractive.

And, of course, sometimes it is unavoidable, or so I told Brad W. When one has too many points for a regular polynomial fit, and they are too scattered for a spline, and too few to try a wavelet “denoising”, and no real theoretical expectation of any particular model function, and all one wants is “a smooth curve, damnit”, then Lowess is just the ticket.

Well, almost.

There is one major problem — how does one figure what the error bounds are on the “best-fit” Lowess curve? Clearly, each fit at each point can produce an estimate of the error, but simply collecting the separate errors is not the right thing to do because they would all be correlated. I know how to propagate Gaussian errors in boxcar smoothing a histogram, but this is a whole new level of complexity. Does anyone know if there is software that can calculate reliable error bands on the smooth curve? We will take any kind of error model — Gaussian, Poisson, even the (local) variances in the data themselves.

Did they, or didn’t they?

vlk — Tue, 20 May 2008 04:10:23 +0000

Earlier this year, Peter Edmonds showed me a press release that the Chandra folks were, at the time, considering putting out describing the possible identification of a Type Ia Supernova progenitor. What appeared to be an accreting white dwarf binary system could be discerned in 4-year old observations, coincident with the location of a supernova that went off in November 2007 (SN2007on). An amazing discovery, but there is a hitch.

And it is a statistical hitch, and involves two otherwise highly reliable and oft used methods giving contradictory answers at nearly the same significance level! Does this mean that the chances are actually 50-50? Really, we need a bona fide statistician to take a look and point out the errors of our ways..

The first time around, Voss & Nelemans (arXiv:0802.2082) looked at how many X-ray sources there were around the candidate progenitor of SN2007on (they also looked at 4 more galaxies that hosted Type Ia SNe and that had X-ray data taken prior to the event, but didn’t find any other candidates), and estimated the probability of chance coincidence with the optical position. When you expect 2.2 X-ray sources/arcmin² near the optical source, the probability of finding one within 1.3 arcsec is tiny, and in fact is around 0.3%. This result has since been reported in Nature.

However, Roelofs et al. (arXiv:0802.2097) went about getting better optical positions and doing better bore-sighting, and as a result, they measured the the X-ray position accurately and also carried out Monte Carlo simulations to estimate the error on the measured location. And they concluded that the actual separation, given the measurement error in the location, is too large to be a chance coincidence, 1.18±0.27 arcsec. The probability ~~that the two locations are the same~~ of finding offsets in the observed range is ~1% [see Tom's clarifying comment below].

Well now, ain’t that a nice pickle?

To recap: there are so few X-ray sources in the vicinity of the supernova that anything close to its optical position cannot be a coincidence, BUT, the measured error in the position of the X-ray source is not copacetic with the optical position. So the question for statisticians now: which argument do you believe? Or is there a way to reconcile these two calculations?

Oh, and just to complicate matters, the X-ray source that was present 4 years ago had disappeared when looked for in December, as one would expect if it was indeed the progenitor. But on the other hand, a lot of things can happen in 4 years, even with astronomical sources, so that doesn’t really confirm a physical link.

The Flip Test

vlk — Thu, 01 May 2008 18:00:08 +0000

Why is it that detection of emission lines is more reliable than that of absorption lines?

That was one of the questions that came up during the recent AstroStat Special Session at HEAD2008. When you look at the iconic Figure 1 from Protassov et al (2002), which shows how the null distribution of the Likelihood Ratio Test (LRT) and how it holds up for testing the existence of emission and absorption lines. The thin vertical lines are the nominal F-test cutoffs for a 5% false positive rate. The nominal F-test is too conservative in the former case (figures a and b; i.e., actual existing lines will not be recognized as such), and is too anti-conservative in the latter case (figure c; i.e., non-existent lines will be flagged as real).

Why the dichotomy in the two cases? David and Eric basically said during the Q&A that followed their talks that when we know that some statistic is calibrated, we can tell how it is distributed, but when it is not, we usually don’t know how badly off it will be in specific cases.

Here’s an interesting tidbit. A long time ago, in the infancy of my grad studenthood, I learned the following trick. When you are confronted with an emission line spectrum, and you think you see a weak line just barely above noise, how do you avoid getting fooled? You flip the spectrum over and look at it upside down! Does that feature still look like a line? Then you are probably OK with thinking it is real.

But why should that trick work? Our brains seem to be somehow rigged to discount absorption lines, so that when an emission feature is flipped over, it becomes “less significant”. This is the opposite of the usual F-test, and is in fact in line with the method recommended by Protassov et al.

Why this should be so, I have no idea. Will that trick work with trained statisticians? Or even with all astronomers? I have no answers, only questions.

Dance of the Errors

vlk — Mon, 21 Jan 2008 19:33:26 +0000

One of the big problems that has come up in recent years is in how to represent the uncertainty in certain estimates. Astronomers usually present errors as +-stddev on the quantities of interest, but that presupposes that the errors are uncorrelated. But suppose you are estimating a multi-dimensional set of parameters that may have large correlations amongst themselves? One such case is that of Differential Emission Measures (DEM), where the “quantity of emission” from a plasma (loosely, how much stuff there is available to emit — it is the product of the volume and the densities of electrons and H) is estimated for different temperatures. See the plots at the PoA DEM tutorial for examples of how we are currently trying to visualize the error bars. Another example is the correlated systematic uncertainties in effective areas (Drake et al., 2005, Chandra Cal Workshop). This is not dissimilar to the problem of determining the significance of a “feature” in an image (Connors, A. & van Dyk, D.A., 2007, SCMA IV).

Here is a specific example that came up due to a comment by a referee on a paper with David G.-A. We had said that the O abundance is dominated by uncertainties in the DEM at low temperatures because that is where most of the emission from O is formed. The referee disputed this, saying yeah, but O is also present at higher temperatures, and since the DEM is much higher there, that should be the predominant contribution to the estimate. In effect, the referee said, “show me!” The problem is, how? The measured fluxes are:

fO7_obs = 2 +- 0.75

fO8_obs = 4 +- 0.88

The predicted fluxes are:

fO7_pred = 1.8 +- 0.72

fO8_pred = 3.6 +- 0.96

where the error bars here come from the stddev of the fluxes predicted by each DEM realization that comes out of the MCMC analysis. On the face of it, it looks like a pretty good match to the observations, though a slightly different picture emerges if one were to look at the distribution of the predicted fluxes:

mode(fO7_pred)=0.76 (95% HPD interval = 0.025:2.44)

mode(fO8_pred)=2.15 (95% HPD interval = 0.95:4.59)

What if one computed the flux at each temperature and did the same calculation separately? That is shown in the following plot, where the product of the DEM and the line emissivity computed at each temperature bin is shown for both O VII (red) and O VIII (blue). The histograms are for the best-fit DEM solution, and the vertical bars are stddevs on the product, which differs from the flux only by a constant. The dashed lines show the 95% highest posterior density intervals.

Figure 1: Fluxes from O VII and O VIII computed at each temperature from DEM solution of RST 137B. The solid histograms are the fluxes for the best-fit DEM, and the vertical bars are the stddev for each temperature bin. The dotted lines denote the 95% highest-posterior density intervals for each temperature.

But even this tells an incomplete tale. The full measure of the uncertainty goes unseen until all the individual curves are seen, as in the animated gif below which shows the flux calculated for each MCMC draw of the DEM:

Figure 2: Predicted flux in O VII and O VIII lines as a product of line emissivity and MCMC samples of the DEM for various temperatures. The dashed histogram is from the best-fit DEM, the solid histograms are for the various samples (the running number at top right indicates the sample sequence; only the last 100 of the 2000 MCMC draws are shown).

So this brings me to my question. How does one represent this kind of uncertainty in a static plot? We know what the uncertainty is, we just don’t know how to publish them.

Wrong Priors?

vlk — Mon, 10 Sep 2007 16:15:31 +0000

arXiv:0709.1067v1 : Wrong Priors (Carlos C. Rodriguez)

This came through today on astro-ph, suggesting that we could be choosing priors better than we do, and in fact that we generally do a very bad job of it. I have been brought up to believe that, like points in Whose Line Is It Anyway, priors don’t matter (unless you have very little data), so I am somewhat confused. What is going on here?

Everything you wanted to know about power-laws but were afraid to ask

vlk — Fri, 08 Jun 2007 04:50:41 +0000

Clauset, Shalizi, & Newman (2007, arXiv/0706.1062) have a very detailed description of what power-law distributions are, how to recognize them, how to fit them, etc. They are also making available their matlab and R codes that they use to do the fitting and such.

Looks like a very handy reference text, though I am a bit uncertain about their use of the K-S test to check whether a dataset can be described with a power-law or not. It is probably fine; perhaps some statisticians would care to comment?