#### [Q] Objectivity and Frequentist Statistics

Is there an objective method to combine measurements of the same quantity obtained with different instruments?

Suppose you have a set of *N _{1}* measurements obtained with one detector, and another set of

*N*measurements obtained with a second detector. And let’s say you wanted something as simple as an estimate of the mean of the quantity (say the intensity) being measured. Let us further stipulate that the measurement errors of each of the points is similar in magnitude and neither instrument displays any odd behavior. How does one combine the two datasets without appealing to subjective biases about the reliability or otherwise of the two instruments?

_{2}We’ve mentioned this problem before, but I don’t think there’s been a satisfactory answer.

The simplest thing to do would be to simply pool all the measurements into one dataset with *N=N _{1}+N_{2}* measurements and compute the mean that way. But if the number of points in each dataset is very different, the simple combined sample mean is actually a statement of bias in favor of the dataset with more measurements.

In a Bayesian context, there seems to be at least a well-defined prescription: define a model, compute the posterior probability density for the model parameters using dataset 1 using some non-informative prior, use this posterior density as the prior density in the next step, where a new posterior density is computed from dataset 2.

What does one do in the frequentist universe?

**[Update 9/30]** After considerable discussion, it seems clear that there is no way to do this without making *some* assumption about the reliability of the detectors. In other words, disinterested objectivity is a mirage.

## hlee:

I wonder who would answer to this question. Is there any dead center frequentist statistician reading the slog? I don’t believe so. Also, like last time, there’s no well defined statistical question to be answered. Let me describe why by providing some rephrased questions.

1. As you described, if the objective of your statistical study is comparing two means while having two allegedly different sample variances,

two sample t-testimmediately comes up in my mind where you can find ways to pool two variances accordingly. (As a matter of fact, there are different formulas for pooling variances but nowadays books do not discuss this matter in detail because of statistical packages)2. The reason why I used “allegedly” is that although it is claimed that empirically, variances from two instruments are different, your words say the claim has not been tested. Therefore, another way to phrase your question is testing whether two sample variances are same or not. Simple

F-test.3. I know that I oversimplified your question, t-test and F-test although I know how grand scientific issues behind this. Probably, I’ll get a book and write down the list of contents, each of which I can state statistical objectives and methodologies in a more elaborated fashion to reflect your statistically somewhat vague question. Instead, since you mentioned having Bayesian methods (although you didn’t specify exact models), I’ll assume there is a likelihood function. Having the likelihood, my corresponding vague questions are “are you interested in estimating the likelihood function?” “is this a mixture problem?” “do you wish to perform a likelihood ratio test to know your model is a model of two different uncertainties? “do you want to compare models? “is your likelihood legitimate?” “what is the obstacle to estimate the likelihood and how to overcome it?” “how to get consistent estimators under such likelihoods? (instead of consistent, it could be other adjectives that statisticians tend to use)” or “what would be realistic methods for estimating the likelihood function and its parameters”?

These questions seem very redundant but considering the number of papers published under “LIKELIHOOD FUNCTION” please understand statisticians only can answer statistically speculated questions.

09-29-2008, 6:28 pm## vlk:

I’m afraid you are answering the wrong question. I am not interested in comparing the means. I am interested in

combiningthem. So neither the two sample t-test nor the F-test seem to be appropriate.This is a very simple question. There are two datasets {X1_i, i=1..N1} and {X2_i, i=1..N2}. They are measurements of the same thing, using different instruments. There may likely be residual unaccounted offsets between the two, but we do not necessarily know which is the correct one. How then do you combine the two datasets to get the combined mean, variance, skew, or whatever else may of interest, in such a manner that the calculated summary statistic exhibits no bias towards either of the datasets? In other words, what is the objective way to combine {X1} and {X2} without making any underlying assumption about the validity of either instrument?

The Bayesian case was given only as a metaphorical example. Please do not assume that there is a likelihood function.

09-29-2008, 6:42 pm## aneta:

you are asking a hard question… I was thinking that this could be just simple addition of counts, but if we have two detectors then for a given source intensity the number of counts detected in one will be different from the number of counts detected in the second detector.

I don;t think you can combine the counts in this case. One needs to think about the source intensity and probability of detection of the observed number of counts in each case separately. Then there may be a method to look at joined probability. What is the question that you want to answer with the analysis?

Bayesian approach seems the easier one…

09-29-2008, 9:58 pm## vlk:

I was hoping to avoid giving too many particulars about the motivating problem, because I really want a general answer. But, if it helps, here’s the original problem.

Two detectors were used to measure the response of the HRMA to X-rays, the FPC and the SSD. They both have different QEs and responses, of course, but there are raytrace programs that can do a very good job of emulating them and can get the answer right to within about 10% (see e.g., Fig 4.2 in the Chandra POG). The task though is to estimate the HRMA effective area based on these measurements. The FPC has about 12 measurements at 9 different energies. The SSD has about 100 measurements at 100 different energies. The two measurements agree to about 10%, but the statistical errors are about 1% for the FPC and 2-5% for the SSD data points, so there is definitely some systematic error there. There is no reason to discount either of the instruments. A correction to the raytrace HRMA EA must be estimated as some kind of “average” between the two (the red curve in the lower panel of Fig 4.2). A Bayesian calculation will be far too complicated because unless it is done right and done completely end-to-end (not gonna happen), there is no point in even trying it.

How do Frequentists choose a weighting to combine such datasets, and how do they choose it objectively?

09-30-2008, 1:04 am## hlee:

To my opinion, there was no statistical question and I only wrote opinions to show how unclear it is to understand your problem. Pooling data sets need way more information beyond the statements that there are two sets of same measurements and you wish to know how to combine them objectively. Without actual data, a priori knowledge, or proper assumptions, no one can set up statistical strategies. Without knowing the truth or developing test statistics, how you know whether there is bias or not before seeking unbiased methods? Are simulations study available for your experiments? A method is unbiased for samples drawn from one parametric family of distributions but it can be biased when samples were drawn from different family. Either Bayesian or Frequentists, we need more information to establish statistical strategies to your interest. If one set of data came from an invalid instrument, how would you know whether the instrument malfunctioned or you observed something peculiar? (I guess experienced eyes will tell the anomaly before this argument) I hope your question to be more articulated in a statistically quantifiable and descriptive way. Otherwise, just combined the data sets and use plug-in estimators of your interest while crossing fingers for results are to be unbiased and robust.

I regret that I commented it. I should have not done it like last time. Instead of deleting it, I just babbled again. As there’s no answer to “which is first, hen or egg,” I don’t think there will be an answer to your question because there is neither a model nor an assumption.

09-30-2008, 1:16 am## vlk:

See response to Aneta’s comment below for details. When two separate experiments purporting to measure the exact same quantity differ by greater than the statistical error, I think it is safe to say that there does exist some bias. We don’t know which experiment is wrong, but that is not the question being posed.

I gave enough detail that a Bayesian strategy can be proposed to deal with it. But using a Bayesian analysis is not feasible in this case for computational reasons. Is there a Frequentist strategy akin to the Bayesian one I sketched out that can be applied? I’ve suspected that there isn’t, and that any strategy that is used must necessarily reflect a choice about the relative reliability of the two experiments.

btw, this is going slightly off-message, but certainly there is an answer to “which came first, hen or egg”. The egg, of course, as a mutation in a proto-hen that then produced the first hen.

09-30-2008, 1:32 am## Simon Vaughan:

Perhaps I’ve missed the point, but what about maximum likelihood fitting? This is frequentist, right? (Wasn’t it Fisher’s idea?) If you have likelihoods for each model-data pair then you can combine them to give one likelihood for the model fitting the two datasets simultaneously. In combining the likelihoods you will account for the difference in response and/or number of data points in each dataset. Or, were you instead after some textbook statistic with a known reference distribution to use as a quick and painless first pass? I.e. without forward-fitting anything?

09-30-2008, 7:21 am## vlk:

In combining the likelihoods you will account for the difference in … number of data points in each datasetAye, there’s the rub. Indeed, people have been doing simultaneous fits to multiple spectra, and in a sense it is similar to the Bayesian prior propagation strategy I outlined above. But the problem is, what to do when there is a great disparity in the number of bins but not in the measurement errors? Your fit will be invariably driven by the spectrum with more bins. That seems wrong to me when your a priori expectation is to give equal weight to both spectra. We know how to deal with that in Bayesian analysis, but what do Frequentists do?

09-30-2008, 9:42 am## Simon Vaughan:

On what basis are you concluding that the two spectra should have equal weight? If one dataset has 100 data points and S/N ~10 per datum, and the other has similar S/N but 10 data points, doesn’t the first dataset contain more information (in either the Shannon or Fisher sense)? Is this because you are more concerned about a systematic error in one (or both) datasets, which is independent of the number of data points?

10-03-2008, 10:08 am## vlk:

Well, it could be because the first dataset oversamples the RMF, and the second integrates over the passband. Could be different telescopes entirely, with different exposure times, and with only a partial overlap in wavelength range. And also, as you say, I am concerned about possible systematic errors in one or both datasets which has nothing to do with the number of data points.

Anyway, after considerable offline discussion, it came down to exactly that point. If you believe that there is some reason to believe one dataset over the other, find a weight that uses that information. Without that information, the only thing to do is lump them all together and pray.

10-03-2008, 2:37 pm## Paul B:

Sorry I’m a bit slow in responding but I thought I’d point out that (if I understand your question correctly) there is a large amount of literature on this — meta-analysis. Meta-analysis is the study of how to combine different studies of the same quantity, usually in the context of clinical/medical trials where different research groups have published different results with different sample sizes, error bars etc. Most of the advances in meta-analysis have come from the Bayesian perspective (as you may expect) but in short, if you want to combine them then you’ll have to make some assumptions. It is an extremely active research field, especially with the proliferation of studies on the same topic — combining them in some coherent manner is definitely challenging but potentially very powerful.

10-07-2008, 1:13 pm## vlk:

Thanks, Paul, that’s exactly the kind of situation here! Is there anything out there that is of relevance to (and comprehensible to) astronomers?

10-07-2008, 2:29 pm## Paul Baines:

I don’t know of any classic review paper, but:

10-08-2008, 12:43 amSutton, A.J., and Higgins, J.P.T. (2007) Recent developments in meta-analysis, Statistics in Medicine, Vol. 27-5, pp.625-650

seems like a decent place to start. It has some basics, some history, and plenty of references for anyone interested. There is also a section in the red book (Bayesian Data Analysis by Gelman et al) that has an introduction and example. Meta-analysis is essentially hierarchical modelling though, the meta-analysis literature is just more concerned with the specific issues involved in combining different studies. Hope that helps…

## brianISU:

I am not sure if it was directly stated above, but one method is to use a finite mixture model. This problem even simplifies things since the proportion of observations coming from a specific distribution is known (meaning we won’t need to estimate the proportions as well). Then, likelihood methods can be applied to estimate the parameters. With these estimates, any function of them will also be the maximum likelihood estimate and intervals estimates can also be obtained.

10-11-2008, 12:39 pm## vlk:

Hmm.. I am not sure I understand, could you give some sort of example? From what I understand, one would resort to a mixture model to explain multimodality in the posterior distribution, essentially assigning explanatory power for the different data to different components.

A simple example that Hyunsook came up with to explain the probelm: suppose you ask a hundred men and ten women how bright the day was, and suppose the hundred men said it was 5+-0.3 (on a scale of 0-10), and the ten women said it was 7+-1.0 How bright

10-13-2008, 1:14 pmwasthe day then?## brianISU:

Let me first comment on how I understand the problem just to make sure that my recommendation still fits and it agrees with your problem of interest. I understand the problem to be we have n1 observations from one detector and n2 observations from another. I am also assuming that these observations come from two independent distributions. So the n1 observations come from distribution f1 and similarly n2 observations come from distribution f2. Also, I am assuming that the interest is only on these two detectors and not the population of detectors (otherwise I would need a different model). So, with these two sets of observations, you would like an estimate of some mean given all the observations. Is this true?

So, under the assumption I understand the problem, let N = n1 + n2. Also let pi1 = n1/N and pi2 = n2/N ( or some other weight scheme ). Then the total finite mixture distribution is:

f(y|theta) = pi1*f1 + pi2*f2 (where theta is a parameter vector containing parameters from f1 and f2)

and the expected values is just a weighted sum of expected values.

10-13-2008, 2:19 pmDoes this help? If not I can discuss a system reliability example.

## brianISU:

I forgot to add that one can have mle estimates of theta and plug these in to the mean of the finite mixture distribution.

10-13-2008, 2:20 pm## vlk:

Ah. That’s an interesting problem, and is perhaps related to the background estimation problem that Hyunsook is worrying about. But in this case, f1 is the same as f2, i.e., both detectors are observing the same thing. There is usually no problem if the detectors are perfectly calibrated with each other, but trouble arises when the statistical error is small in some sense compared to the difference between the two and there is no a priori reason to believe one detector over the other.

10-13-2008, 4:30 pm