#### The Burden of Reviewers

*Astronomers write literally thousands of proposals each year to observe their favorite targets with their favorite telescopes. Every proposal must be accompanied by a technical justification, where the proposers demonstrate that their goal is achievable, usually via a simulation. Surprisingly, a large number of these justifications are statistically unsound. Guest Slogger Simon Vaughan describes the problem and shows what you can do to make reviewers happy (and you definitely want to keep reviewers happy).*

The feasibility analysis of any observing time proposal is one of its most important sections. For example, the XMM-Newton AO7 Policies & Procedures document says: “A realistic estimate of the observing time is a major selection criterion for the OTAC.”

The central part of the feasibility is usually a justification of the requested exposure time, using an analytical or numerical model of the observing process together with some assumptions about the expected data, to demonstrate the suitability of the expected data to answer the scientific question at hand. Telescope time is expensive and precious, so it is important not to waste it; too much means wasting observing time that could be spend on other science, and too little means potentially wasting the entire observation if the science goals cannot be achieved.

Unfortunately, the procedure that is standard in X-ray astronomy is missing one important piece of information; a proposal will usually feature a calculation of `significance’ of the desired result expected for a typical observation, but no figure will be given for the chance that the result might go missed. (The `significance’ itself is a random variable, for many proposals a large proportion of random realizations of the observation may yield insignificant results.) This is compounded by the fact that many proposals make use of just one random simulation in the calculation – one simulation that gives a satisfactory ‘result’. What is important is the distribution of expected data. What proportion show a ‘significant’ result?

Surely a proposal that has a ‘failure probability’ of 0.8 is clearly a more risky proposition than one with a failure probability <0.1. Yet it would not be difficult to produce a single, convincing looking simulation in either case. I do not propose any particular value as a benchmark for the failure probability — the selection panel must make that judgement based on the perceived worth of the science goals of the proposal. The point is that the ‘failure probability’ it is a piece of information that should be made available to proposal panels in order for them to make reasonable judgements about the best use of telescope time.

Broadly speaking there are three types of science goals for telescope proposals: (1) testing hypotheses; (2) estimating parameters of a specific model; (3) exploratory observations with no specific model to test (so-called `fishing expeditions’). The above argument is most clear for the first type, but is equally valid for the second type. A proposal to estimate some model parameter(s) will usually justify the requested exposure time on the basis that it will provide a confidence (or credible) region for the parameter that is sufficiently small for the estimate to be scientifically useful. In practice most proposal writers use either the expected values of the confidence limits, or the values obtained from a single simulation. In either case there may be a high, and unspecified, probability that the observation will in fact produce a confidence interval larger than requested (in which case the observation again does not meet the requirements of the proposal).

The solution to this problem is to include, as standard, a ‘power’ calculation in the feasibility section of proposals. A power calculation for an X-ray telescope proposal would be relatively straightforward. Instead of simulating one dataset at a given exposure time, one would generate a larger number of simulations and ensure that the fraction of non-detections (or too wide confidence intervals) was sufficiently small. Of course, in order to perform such a calculation one must have a well-defined hypotheses to test (e.g. a new source of given brightness, or a spectral line at a particular location and strength). This should be true of almost all proposals except perhaps the exploratory `fishing expeditions.’ If a hypothesis is not completely specified (i.e. has parameters with uncertain values), one might be able to perform simulations using a plausible distribution of parameters values (`predictive’ simulations?). If such power calculations were included as standard in most X-ray telescope proposals, the selection panel would have a valuable piece of additional information on which to base their judgements. Power calculations are a staple of experimental design in fields such as medical research, where sample sizes are set at a level that gives a reasonable probability of detecting the effect being sought, if it is real.

Unfortunately, there is no mechanism for properly reviewing the success of completed observing proposals, and so systematic over- or under-estimates in the exposure times may not be immediately apparent. The fact that so many observations do provide interesting results may tell us more about the richness of Nature (providing unplanned results), or the ingenuity of the observers to make use of the available data, than the design goals of the proposal.

## vlk:

I have a Golden Rule that I apply first to any proposal I am called upon to review. Granting that the observation will produce exactly the data that the proposal calls for, are they sufficient to achieve the goals of the observation? Often it turns out not to be so.

04-17-2008, 4:45 pm## aneta:

Has anybody checked how often the predicted confidence claimed in the proposal agreed with the performed observations? This might be statistically interesting study to do.

But what about unexpected results, discoveries that are hard to predict? The observations that bring exactly what we have planned for in the simulations are very good, but may not be as exciting as the ones that carry the discoveries. How many of those have we had recently? Plenty of X-ray jets, unpredicted spectral features, interesting morphology of X-ray gas etc…. How can the panel foresee those in the final decision?

04-17-2008, 9:49 pm## Simon Vaughan:

Of course we all rely on serendipity to bring us what we never dreamed of, but the panel must surely judge a proposal based on the scientific case. Any proposal may produce an exciting serendipitous discovery, but there’s no way to judge which ones are more likely to – except perhaps for ‘fishing expeditions’ which are open searches for *any* new result, where panels might favour the least explored class of objects.

It would indeed be interesting to compare the results of feasibility calculations to those from the real data, but I don’t see how this is possible since proposals are always kept secret. Maybe panels should be asked to log a one line description of the ‘significance’ (or similar) that is claimed in the proposals, which could later be compared to the result from the real data by the mission teams. They could then compare the statistics without revealing any private information on specific proposals. (Trouble is, no panel or mission team wants the extra work, and publication bias means that the ‘significance’ of the intended result may never be published if it is not as high as advertised in the proposal!)

Again, medical trials are ahead of us (in this respect). It is now standard (at least in the UK) to log the details of a medical trial (science goal, sample size, duration, etc) before it is performed. That way there’s a record of every trial, whether or not the results are published. The number of non-results can then be estimated. I should point out that what concerns me is the lack of a result, not an interesting null result. If a predicted feature is not confirmed, at high ‘significance’, that is all well and good, and potentially interesting. But if the data are inadequate to tell either way then the proposal was a waste of telescope time.

04-18-2008, 4:47 am## vlk:

As Simon says, proposals are never made public, so it is difficult to check post facto how well they held up. But if anyone is curious enough to do the hard yards, do literature searches, comparative proposalology and the like, a good starting point is the Chandra list of accepted targets: http://cxc.harvard.edu/target_lists/index.html

A word of warning also, that proposals should not be held to the same standard as manuscripts submitted to journals (because proposals are usually written in a tearing hurry, and the space limitations can be a crippling limitation, and it is simply not fair to expect the same analysis effort spent on simulated data as would be on real data), so some sloppiness is acceptable.

04-19-2008, 12:15 am## Simon Vaughan:

I agree with Vinay that proposals usually are written in a huury, but I’m not convinced this is a valid excuse for a sloppy feasibility study. Telescope time is very expensive: an order of magnitude estimate might be sometime like $1 per second of exposure for a major missing like CXO or XMM. Individual proposals are therefore worth ~$10,000 to $1,000,000. One the one hand we have a duty to the tax payers (and fellow observers) not to waste this precious resource by proposing observations that have a high chance of failure (by routinely hiding or ignoring this fact). On the other hand, given the high oversubscription rates of these missions, one can understand the reluctance of individual proposals to spend their own time performing lengthy calculations. The self-interests of individual proposers may be resulting in large amounts of wasted or far-from-optimal observations. This could be rectified if there was a requirement to include a ‘power’ calculation in the feasibility study for each proposal. That might simultaneously increase the time proposers spend on each proposals, increase the quality (or achievability) of the average proposal, and reduce the burden on reviewers.

But then the bureaucrats would worry the oversubscription rate was dropping! (You cannot please everyone all the time.)

04-19-2008, 7:35 am## Simon Vaughan:

Sorry – my spellchecking went a bit mad on the last post!

04-19-2008, 7:37 am## hlee:

What would be the most difficult challenge to implement the power calculation in say FAKEIT while avoiding Monte Carlo simulation? Because of the expense and ethics, simple size and time is chosen to control the levels of type I and type II errors in biostatistics by scrutinizing models. If a tool is developed and implemented, such sloppiness and the burden for panels is no concern. I wonder how this problem can be laid out in a more statistical sense.

04-22-2008, 1:18 am## vlk:

I think focusing on FAKEIT is a red herring. All it does is generate counts by drawing from a Poisson distribution, given the astrophysical model and the instrument model (ARF+RMF). It doesn’t carry out either significance tests or power tests. Those are tests that the astronomer chooses to apply _using_ the simulated counts from FAKEIT (or other sources).

04-22-2008, 6:27 pm## hlee:

1. If vlk’s comment is an answer to

FAKEIT, please, advise me any modules in Sherpa or XSPEC that consider exposure time as a variable and is used in writing proposals. I like to investigate providing a guideline for suitable exposure times with given type I and type II errors. I didn’t mean FAKEIT does tests. I thought that since it considers time as a variable and people use FAKEIT to get exposure times for their proposals, FAKEIT can be a starting point for further power studies.2. If vlk’s comment is an answer to

to implement the power calculation, I’m disappointed because the comment sounds like one is selling a drug before the completion of a clinical trial. Consider a prescription drug known for curing a disease. Unfortunately, there were a few reports (or some doubts) that the drug could have killed/deteriorated patients. How could we tell the drug actually killed/could kill a person? Since FAKEIT simulates spectra according to exposure time, we could build an empirical power function along exposure time. In stead of quitting the clinical trials, it would be nice to attempt a study with some simple source models by utilizing the ready implemented modules (packages).I guess Simon Vaughan has the sketch to resolve the problem. Can it be shared???

04-22-2008, 11:30 pm## vlk:

Hyunsook, I think you are mistakenly conflating the tool with the analysis. All FAKEIT is, is a glorified random number generator. I don’t see how you can implement power calculations into it when it is inherently unknowable what context it will be used in. Proposers can be endlessly inventive in terms of the types of tests they carry out, and each proposal comes with its own specific analyses and tests. Furthermore, it is just one tool among many, and not everybody uses Sherpa or XSPEC.

Perhaps you meant to give examples of how one goes about doing a power calculation. That will be useful, certainly. But in that case, all you need to say in your prescription is “generate dataset from model here” (“generate_fake_dataset()”) and it will be understood what is meant.

04-23-2008, 1:03 am## David van Dyk:

This general problem is analogous to “sample size calculations” that go into most any proposal for funding of a medical study. “If my drug has a effect of X, how many individuals must be in my study so that I have an XX% chance of getting a positive result.” So lots of people have studied it, but with very different models than are used in astronomy.

Simulations are an easy way to proceed when mathematical calculations are difficult. Simulate 100 data sets WITH the supposed feature and a given exposure time. Run a statistical analysis of each and see what percent of the time you get a positive detection. This is (as Simon says) a power calculation. I’m surprised it is not standard. This NIH (US National Institute of Health, the main US Health Studies Funding Agency) would never fund something without a power calculation!

With only one simulation you could be very mislead. If that were not the case, the exposure time is probably too long. You are interested in using just enough exposure time to see the feature (to save/share resources, I assume). Thus the power should not be 100%.

05-07-2008, 4:40 am