Comments on: The Burden of Reviewers http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/ Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 01 Jun 2012 18:47:52 +0000 hourly 1 http://wordpress.org/?v=3.4 By: David van Dyk http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/comment-page-1/#comment-216 David van Dyk Wed, 07 May 2008 08:40:54 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=272#comment-216 This general problem is analogous to "sample size calculations" that go into most any proposal for funding of a medical study. "If my drug has a effect of X, how many individuals must be in my study so that I have an XX% chance of getting a positive result." So lots of people have studied it, but with very different models than are used in astronomy. Simulations are an easy way to proceed when mathematical calculations are difficult. Simulate 100 data sets WITH the supposed feature and a given exposure time. Run a statistical analysis of each and see what percent of the time you get a positive detection. This is (as Simon says) a power calculation. I'm surprised it is not standard. This NIH (US National Institute of Health, the main US Health Studies Funding Agency) would never fund something without a power calculation! With only one simulation you could be very mislead. If that were not the case, the exposure time is probably too long. You are interested in using just enough exposure time to see the feature (to save/share resources, I assume). Thus the power should not be 100%. This general problem is analogous to “sample size calculations” that go into most any proposal for funding of a medical study. “If my drug has a effect of X, how many individuals must be in my study so that I have an XX% chance of getting a positive result.” So lots of people have studied it, but with very different models than are used in astronomy.

Simulations are an easy way to proceed when mathematical calculations are difficult. Simulate 100 data sets WITH the supposed feature and a given exposure time. Run a statistical analysis of each and see what percent of the time you get a positive detection. This is (as Simon says) a power calculation. I’m surprised it is not standard. This NIH (US National Institute of Health, the main US Health Studies Funding Agency) would never fund something without a power calculation!

With only one simulation you could be very mislead. If that were not the case, the exposure time is probably too long. You are interested in using just enough exposure time to see the feature (to save/share resources, I assume). Thus the power should not be 100%.

]]>
By: vlk http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/comment-page-1/#comment-211 vlk Wed, 23 Apr 2008 05:03:36 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=272#comment-211 Hyunsook, I think you are mistakenly conflating the tool with the analysis. All FAKEIT is, is a glorified random number generator. I don't see how you can implement power calculations into it when it is inherently unknowable what context it will be used in. Proposers can be endlessly inventive in terms of the types of tests they carry out, and each proposal comes with its own specific analyses and tests. Furthermore, it is just one tool among many, and not everybody uses Sherpa or XSPEC. Perhaps you meant to give examples of how one goes about doing a power calculation. That will be useful, certainly. But in that case, all you need to say in your prescription is "generate dataset from model here" ("generate_fake_dataset()") and it will be understood what is meant. Hyunsook, I think you are mistakenly conflating the tool with the analysis. All FAKEIT is, is a glorified random number generator. I don’t see how you can implement power calculations into it when it is inherently unknowable what context it will be used in. Proposers can be endlessly inventive in terms of the types of tests they carry out, and each proposal comes with its own specific analyses and tests. Furthermore, it is just one tool among many, and not everybody uses Sherpa or XSPEC.

Perhaps you meant to give examples of how one goes about doing a power calculation. That will be useful, certainly. But in that case, all you need to say in your prescription is “generate dataset from model here” (“generate_fake_dataset()”) and it will be understood what is meant.

]]>
By: hlee http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/comment-page-1/#comment-210 hlee Wed, 23 Apr 2008 03:30:07 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=272#comment-210 1. If vlk's comment is an answer to <u>FAKEIT</u>, please, advise me any modules in Sherpa or XSPEC that consider exposure time as a variable and is used in writing proposals. I like to investigate providing a guideline for suitable exposure times with given type I and type II errors. I didn't mean FAKEIT does tests. I thought that since it considers time as a variable and people use FAKEIT to get exposure times for their proposals, FAKEIT can be a starting point for further power studies. 2. If vlk's comment is an answer to <u>to implement the power calculation</u>, I'm disappointed because the comment sounds like one is selling a drug before the completion of a clinical trial. Consider a prescription drug known for curing a disease. Unfortunately, there were a few reports (or some doubts) that the drug could have killed/deteriorated patients. How could we tell the drug actually killed/could kill a person? Since FAKEIT simulates spectra according to exposure time, we could build an empirical power function along exposure time. In stead of quitting the clinical trials, it would be nice to attempt a study with some simple source models by utilizing the ready implemented modules (packages). I guess Simon Vaughan has the sketch to resolve the problem. Can it be shared??? 1. If vlk’s comment is an answer to FAKEIT, please, advise me any modules in Sherpa or XSPEC that consider exposure time as a variable and is used in writing proposals. I like to investigate providing a guideline for suitable exposure times with given type I and type II errors. I didn’t mean FAKEIT does tests. I thought that since it considers time as a variable and people use FAKEIT to get exposure times for their proposals, FAKEIT can be a starting point for further power studies.

2. If vlk’s comment is an answer to to implement the power calculation, I’m disappointed because the comment sounds like one is selling a drug before the completion of a clinical trial. Consider a prescription drug known for curing a disease. Unfortunately, there were a few reports (or some doubts) that the drug could have killed/deteriorated patients. How could we tell the drug actually killed/could kill a person? Since FAKEIT simulates spectra according to exposure time, we could build an empirical power function along exposure time. In stead of quitting the clinical trials, it would be nice to attempt a study with some simple source models by utilizing the ready implemented modules (packages).

I guess Simon Vaughan has the sketch to resolve the problem. Can it be shared???

]]>
By: vlk http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/comment-page-1/#comment-209 vlk Tue, 22 Apr 2008 22:27:09 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=272#comment-209 I think focusing on FAKEIT is a red herring. All it does is generate counts by drawing from a Poisson distribution, given the astrophysical model and the instrument model (ARF+RMF). It doesn't carry out either significance tests or power tests. Those are tests that the astronomer chooses to apply _using_ the simulated counts from FAKEIT (or other sources). I think focusing on FAKEIT is a red herring. All it does is generate counts by drawing from a Poisson distribution, given the astrophysical model and the instrument model (ARF+RMF). It doesn’t carry out either significance tests or power tests. Those are tests that the astronomer chooses to apply _using_ the simulated counts from FAKEIT (or other sources).

]]>
By: hlee http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/comment-page-1/#comment-207 hlee Tue, 22 Apr 2008 05:18:40 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=272#comment-207 What would be the most difficult challenge to implement the power calculation in say FAKEIT while avoiding Monte Carlo simulation? Because of the expense and ethics, simple size and time is chosen to control the levels of type I and type II errors in biostatistics by scrutinizing models. If a tool is developed and implemented, such sloppiness and the burden for panels is no concern. I wonder how this problem can be laid out in a more statistical sense. What would be the most difficult challenge to implement the power calculation in say FAKEIT while avoiding Monte Carlo simulation? Because of the expense and ethics, simple size and time is chosen to control the levels of type I and type II errors in biostatistics by scrutinizing models. If a tool is developed and implemented, such sloppiness and the burden for panels is no concern. I wonder how this problem can be laid out in a more statistical sense.

]]>
By: Simon Vaughan http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/comment-page-1/#comment-205 Simon Vaughan Sat, 19 Apr 2008 11:37:36 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=272#comment-205 Sorry - my spellchecking went a bit mad on the last post! Sorry – my spellchecking went a bit mad on the last post!

]]>
By: Simon Vaughan http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/comment-page-1/#comment-204 Simon Vaughan Sat, 19 Apr 2008 11:35:46 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=272#comment-204 I agree with Vinay that proposals usually are written in a huury, but I'm not convinced this is a valid excuse for a sloppy feasibility study. Telescope time is very expensive: an order of magnitude estimate might be sometime like $1 per second of exposure for a major missing like CXO or XMM. Individual proposals are therefore worth ~$10,000 to $1,000,000. One the one hand we have a duty to the tax payers (and fellow observers) not to waste this precious resource by proposing observations that have a high chance of failure (by routinely hiding or ignoring this fact). On the other hand, given the high oversubscription rates of these missions, one can understand the reluctance of individual proposals to spend their own time performing lengthy calculations. The self-interests of individual proposers may be resulting in large amounts of wasted or far-from-optimal observations. This could be rectified if there was a requirement to include a 'power' calculation in the feasibility study for each proposal. That might simultaneously increase the time proposers spend on each proposals, increase the quality (or achievability) of the average proposal, and reduce the burden on reviewers. But then the bureaucrats would worry the oversubscription rate was dropping! (You cannot please everyone all the time.) I agree with Vinay that proposals usually are written in a huury, but I’m not convinced this is a valid excuse for a sloppy feasibility study. Telescope time is very expensive: an order of magnitude estimate might be sometime like $1 per second of exposure for a major missing like CXO or XMM. Individual proposals are therefore worth ~$10,000 to $1,000,000. One the one hand we have a duty to the tax payers (and fellow observers) not to waste this precious resource by proposing observations that have a high chance of failure (by routinely hiding or ignoring this fact). On the other hand, given the high oversubscription rates of these missions, one can understand the reluctance of individual proposals to spend their own time performing lengthy calculations. The self-interests of individual proposers may be resulting in large amounts of wasted or far-from-optimal observations. This could be rectified if there was a requirement to include a ‘power’ calculation in the feasibility study for each proposal. That might simultaneously increase the time proposers spend on each proposals, increase the quality (or achievability) of the average proposal, and reduce the burden on reviewers.

But then the bureaucrats would worry the oversubscription rate was dropping! (You cannot please everyone all the time.)

]]>
By: vlk http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/comment-page-1/#comment-203 vlk Sat, 19 Apr 2008 04:15:14 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=272#comment-203 As Simon says, proposals are never made public, so it is difficult to check post facto how well they held up. But if anyone is curious enough to do the hard yards, do literature searches, comparative proposalology and the like, a good starting point is the Chandra list of accepted targets: http://cxc.harvard.edu/target_lists/index.html A word of warning also, that proposals should not be held to the same standard as manuscripts submitted to journals (because proposals are usually written in a tearing hurry, and the space limitations can be a crippling limitation, and it is simply not fair to expect the same analysis effort spent on simulated data as would be on real data), so some sloppiness is acceptable. As Simon says, proposals are never made public, so it is difficult to check post facto how well they held up. But if anyone is curious enough to do the hard yards, do literature searches, comparative proposalology and the like, a good starting point is the Chandra list of accepted targets: http://cxc.harvard.edu/target_lists/index.html

A word of warning also, that proposals should not be held to the same standard as manuscripts submitted to journals (because proposals are usually written in a tearing hurry, and the space limitations can be a crippling limitation, and it is simply not fair to expect the same analysis effort spent on simulated data as would be on real data), so some sloppiness is acceptable.

]]>
By: Simon Vaughan http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/comment-page-1/#comment-201 Simon Vaughan Fri, 18 Apr 2008 08:47:53 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=272#comment-201 Of course we all rely on serendipity to bring us what we never dreamed of, but the panel must surely judge a proposal based on the scientific case. Any proposal may produce an exciting serendipitous discovery, but there's no way to judge which ones are more likely to - except perhaps for 'fishing expeditions' which are open searches for *any* new result, where panels might favour the least explored class of objects. It would indeed be interesting to compare the results of feasibility calculations to those from the real data, but I don't see how this is possible since proposals are always kept secret. Maybe panels should be asked to log a one line description of the 'significance' (or similar) that is claimed in the proposals, which could later be compared to the result from the real data by the mission teams. They could then compare the statistics without revealing any private information on specific proposals. (Trouble is, no panel or mission team wants the extra work, and publication bias means that the 'significance' of the intended result may never be published if it is not as high as advertised in the proposal!) Again, medical trials are ahead of us (in this respect). It is now standard (at least in the UK) to log the details of a medical trial (science goal, sample size, duration, etc) before it is performed. That way there's a record of every trial, whether or not the results are published. The number of non-results can then be estimated. I should point out that what concerns me is the lack of a result, not an interesting null result. If a predicted feature is not confirmed, at high 'significance', that is all well and good, and potentially interesting. But if the data are inadequate to tell either way then the proposal was a waste of telescope time. Of course we all rely on serendipity to bring us what we never dreamed of, but the panel must surely judge a proposal based on the scientific case. Any proposal may produce an exciting serendipitous discovery, but there’s no way to judge which ones are more likely to – except perhaps for ‘fishing expeditions’ which are open searches for *any* new result, where panels might favour the least explored class of objects.

It would indeed be interesting to compare the results of feasibility calculations to those from the real data, but I don’t see how this is possible since proposals are always kept secret. Maybe panels should be asked to log a one line description of the ‘significance’ (or similar) that is claimed in the proposals, which could later be compared to the result from the real data by the mission teams. They could then compare the statistics without revealing any private information on specific proposals. (Trouble is, no panel or mission team wants the extra work, and publication bias means that the ‘significance’ of the intended result may never be published if it is not as high as advertised in the proposal!)

Again, medical trials are ahead of us (in this respect). It is now standard (at least in the UK) to log the details of a medical trial (science goal, sample size, duration, etc) before it is performed. That way there’s a record of every trial, whether or not the results are published. The number of non-results can then be estimated. I should point out that what concerns me is the lack of a result, not an interesting null result. If a predicted feature is not confirmed, at high ‘significance’, that is all well and good, and potentially interesting. But if the data are inadequate to tell either way then the proposal was a waste of telescope time.

]]>
By: aneta http://hea-www.harvard.edu/AstroStat/slog/2008/the-burden-of-reviewers/comment-page-1/#comment-196 aneta Fri, 18 Apr 2008 01:49:43 +0000 http://hea-www.harvard.edu/AstroStat/slog/?p=272#comment-196 Has anybody checked how often the predicted confidence claimed in the proposal agreed with the performed observations? This might be statistically interesting study to do. But what about unexpected results, discoveries that are hard to predict? The observations that bring exactly what we have planned for in the simulations are very good, but may not be as exciting as the ones that carry the discoveries. How many of those have we had recently? Plenty of X-ray jets, unpredicted spectral features, interesting morphology of X-ray gas etc.... How can the panel foresee those in the final decision? Has anybody checked how often the predicted confidence claimed in the proposal agreed with the performed observations? This might be statistically interesting study to do.

But what about unexpected results, discoveries that are hard to predict? The observations that bring exactly what we have planned for in the simulations are very good, but may not be as exciting as the ones that carry the discoveries. How many of those have we had recently? Plenty of X-ray jets, unpredicted spectral features, interesting morphology of X-ray gas etc…. How can the panel foresee those in the final decision?

]]>