]]>Instead of “confidence interval,” let’s say “uncertainty interval”

I’ve noticed that there are rapidly growing interests and attentions in data mining and machine learning among astronomers but the level of execution is yet rudimentary or partial because there has been no comprehensive tutorial style literature or book for them. I recently introduced a machine learning book written by an engineer. Although it’s a very good book, it didn’t convey the foundation of machine learning built by statisticians. In the quest of searching another good book so as to satisfy the astronomers’ pursuit of (machine) learning methodology with the proper amount of statistical theories, the first great book came along is **The Elements of Statistical Learning**. It was chosen for this writing not only because of its fame and its famous authors (Hastie, Tibshirani, and Friedman) but because of my personal story. In addition, the 2nd edition, which contains most up-to-date and state-of-the-art information, was released recently.

First, the book website:

The Elements of Statistical Learning byHastie, Tibshirani, and Friedman

You’ll find examples, R codes, relevant publications, and plots used in the text books.

Second, I want to tell how I learned about this book before its first edition was published. Everyone has a small moment of meeting very famous people. Mine is shaking hands with President Clinton, in 2000. I still remember the moment vividly because I really wanted to tell him that ice cream was dripping on his nice suit but the top of the line guards blocked my attempt of speaking/pointing icecream dripping with a finger afterward the hand shaking. No matter what context is, shaking hands with one of the greatest presidents is a memorable thing. Yet it was not my cherishing moment because of icecreaming dripping and scary bodyguards. My most cherishing moment of meeting famous people is the half an hour conversation with late Prof. Leo Breinman (click for my two postings about him), author of probability textbook, creator of CART, and the most forefront pioneer in machine learning.

The conclusion of that conversation was a book soon to be published after explaining him my ideas of applying statistics to astronomical data and his advices to each problems. I was not capable to understand every statistics so that his answer about this new coming book at that time was the most relevant and apt one.

This conversation happened during the 3rd Statistical Challenges in Modern Astronomy (SCMA). Not long passed since I began my graduate study in statistics but had an opportunity to assist the conference organizer, my advisor Dr. Babu and to do some chores during the conference. By accident, I read the book by Murtagh about multivariate data analysis, so I wanted to speak to him. Except that, I have no desire to speak renown speakers and attendees. Frankly, I didn’t have any idea who’s who at the conference and a few years later, I realized that the conference dragged many famous people and the density of such people was higher than any conference I attended. Who would have imagine that I could have a personal conversation with Prof. Breiman, at that time. I have seen enough that many famous professors train people during conferences. Getting a chance for chatting some seconds are really hard and tall/strong people push someone small like me away always.

The story goes like this: a sunny perfect early summer afternoon, he was taking a break for a cigar and I finished my errands for the session. Not much to do until the end of session, I decided to take some fresh air and I spotted him enjoying his cigar. Only the worst was that I didn’t know he was the person of CART and the founder of statistical machine learning. Only from his talk from the previous session, I learned he was a statistician, who did data mining on galaxies. So, I asked him if I can join him and ask some questions related to some ideas that I have. One topic I wanted to talk about classification of SN light curves, by that time from astronomical text books, there are Type I & II, and Type I has subcategories, Ia, Ib, and Ic. Later, I heard that there is Type III. But the challenge is observations didn’t happen with equal intervals. There were more data mining topics and the conversation went a while. In the end, he recommended me a book which will be published soon.

Having such a story, a privilege of talking to late Prof. Breiman through an very unique meeting, SCMA, before knowing the fame of the book, this book became one of my favorites. The book, indeed, become popular, around that time, almost only book discussing statistical learning; therefore, it was an excellent textbook for introducing statistics to engineerers and machine learning to statisticians. In the mean time, statistical learning enjoyed popularity in many disciplines that have data sets and urging for learning with the aid of machine. Now books and journals on machine learning, data mining, and knowledge discovery (KDD) became prosperous. I was so delighted to see the 2nd edition in the market to bridge the gap over the years.

I thank him for sharing his cigar time, probably his short free but precious time for contemplation, with me. I thank his patience of spending time with such an ignorant girl with a foreign english accent. And I thank him for introducing a book which will became a bible in the statistical learning community within a couple of years (I felt proud of myself that I access the book before people know about it). Perhaps, astronomers cannot have many joys from this book that I experienced from how I encounter the book, who introduced the book, whether the book was used in a course, how often book is referred, etc. But I assure that it’ll narrow the gap in the notions how astronomers think about data mining (preprocessing, pipelining, and bulding catalogs) and how statisticians treat data mining. The newly released 2nd edition would help narrowing the gap further and assist astronomers to coin brilliant learning algorithms specific for astronomical data. [The END]

—————————– Here, I patch my scribbles about the book.

What distinguish this book from other machine learning books is that not only authors are big figures in statistics but also fundamentals of statistics and probability are discussed in all chapters. Most of machine learning books only introduce elementary statistics and probability in chapter 2, and no basics in statistics is discussed in later chapters. Generally, empirical procedures, computer algorithms, and their results without presenting basic theories in statistics are presented.

You might want to check the book’s website for data sets if you want to try some ideas described there

The Elements of Statistical Learning

In addition to its historical footprint in the field of statistical learning, I’m sure that some astronomers want to check out topics in the book. It’ll help to replace some data analysis methods in astronomy celebrating their centennials sooner or later with state of the art methods to cope with modern data.

This new edition reflects some evolutions in statistical learning whereas the first edition has been an excellent harbinger of the field. Pages quoted from the 2nd edition.

[p.28] Suppose in fact that our data arose from a statistical model $Y=f(X)+e$ where the random error e has E(e)=0 and is independent of X. Note that for this model, f(x)=E(Y|X=x) and in fact the conditional distribution Pr(Y|X) depends on X

onlythrough the conditional mean f(x).

The additive error model is a useful approximation to the truth. For most systems the input-output pairs (X,Y) will not have deterministic relationship Y=f(X). Generally there will be other unmeasured variables that also contribute to Y, including measurement error. The additive model assumes that we can capture all these departures from a deterministic relationship via the error e.

How statisticians envision “model” and “measurement errors” quite different from astronomers’ “model” and “measurement errors” although in terms of “additive error model” they are matching due to the properties of Gaussian/normal distribution. Still, the dilemma of hen or eggs exists prior to any statistical analysis.

[p.30] Although somewhat less glamorous than the learning paradigm, treating supervised learning as a problem in function approximation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probabilistic inference to be applied to the problem. This is the approach taken in this book.

Strongly recommend to read chapter 3, Linear Methods for Regression: In astronomy, there are so many important coefficients from regression models, from Hubble constant to absorption correction (temperature and magnitude conversion is another example. It seems that these relations can be only explained via OLS (ordinary least square) with the homogeneous error assumption. Yet, books on regressions and linear models are not generally thin. As much diversity exists in datasets, more amount of methodology, theory and assumption exists in order to reflect that diversity. One might like to study the statistical properties of these indicators based on mixture and hierarchical modeling. Some inference, say population proportion can be drawn to verify some hypotheses in cosmology in an indirect way. Understanding what regression analysis and assumptions and how statistician efforts made these methods more robust and interpretable, and reflecting reality would change forcing E(Y|X)=aX+b models onto data showing correlations (not causality).

]]>When I was teaching statistics, despite undergraduate courses, there were both undergraduate and graduate students of various fields except astrophysics majors. I wondered why they were not encouraged to take some basic statistics whereas they were encouraged to take some computer science courses. As there are many astronomers good at programming and designing tools, I’m sure that recommending students to take statistics courses will renovate astronomical data analysis procedures (beyond Bevington’s book) and hind theories (statistics and mathematics per se, not physics laws).

Here’s an interesting lecture for developing a curriculum for the new era in computer science and why the basic probability theory and statistics is important to raise versatile computer scientists. It could be a bit out dated now because I saw it several months ago.

About a little more than the half way through the lecture, he emphasizes that probability course partaking the computer science curriculum. I wonder any astronomy professor has similar arguments and stresses for any needs of basic probability theories to be learned among young future astrophysicists in order to prevent many statistics misuses appearing in astronomical literature. Particularly confusions between fitting (estimating) and inference (both model assessment and uncertainty quantification) are frequently observed in literature where authors claim their superior statistics and statistical data analysis. I personally sometimes attribute this confusion to the lack of distinction between what is random and what is deterministic, or strong believe in their observed and processed data absent from errors and probabilistic nature.

Many introductory books introduce very interesting problems many of which have some historical origins to introduce probability theories (many anecdotes). One can check out the very basics, **probability axioms**, ** and ****measurable function** from wikipedia. With examples, probability is high school or lower level math that you already know but with jargon you’ll like to recite lexicons many times so that you are get used to foundations, basics, and their theories.

We often mention **measurable** to discuss random variables, uncertainties, and distributions without verbosity. “Assume **measurable space** … ” saves multiple paragraphs in an article and changes the structure of writing. This short adjective implies so many assumptions depending on statistical models and equations that you are using for best fits and error bars.

Consider a LF, that is truncated due to observational limits. The common practice I saw is drawing a histogram in a way that the adaptive binning makes the overall shape reflecting a partial bell shape curve. Thanks to its smoothed look, scientists impose a gaussian curve to partially observed data and find parameter estimates that determine the shape of this gaussian curve. There is no imputation step to fake unobserved points to comprise the full probability space. The parameter space of gaussian curves frequently does not coincide with the physically feasible space; however, such discrepancy is rarely discussed in astronomical literature and subsequent biased results look like a taboo.

Although astronomers emphasize the importance of uncertainties, factorization nor stratification of uncertainties has never been clear (model uncertainty, systematic uncertainty or bias, statistical uncertainties or variance). Hierarchical relationships or correlations among these different uncertainties are never addressed in a full measure. Basics of probability theory and the understanding of random variables would help to characterize uncertainties both in mathematical sense and astrophysical sense. This knowledge also assist appropriate quantification of these characterized uncertainties.

Statistical models are rather simple compared to models of astrophysics. However, statistics is the science of understanding uncertainties and randomness and therefore, some strategies of transcribing from complicated astrophysical models into statistical models, in order to reflect the probabilistic nature of observed (or parameters, for Bayesian modeling), are necessary. Both raw or processed data manifest the behavior of random variables. Their underlying processes determine not only physics models but also statistical models written in terms of random variables and the link functions connecting physics and uncertainties. To my best understanding, bridging and inventing statistical models for astrophysics researches seem tough due to the lack of awareness of basics of probability theory.

Once I had a chance to observe a Decadal survey meeting, which covered so diverse areas in astronomy. They discussed new projects, advancing current projects, career developments, and a little bit about educating professional astronomers apart from public reach (which often receives more importance than university curriculum. I also believe that wide spread public awareness of astronomy is very important). What I missed while I observing the meeting is that interdisciplinary knowledge transferring efforts to broaden the field of astronomy and astrophysics nor curriculum design ideas. Because of its long history, I thought astronomy is a science of everything. Marching a path for a long time made astronomy more or less the most isolated and exclusive science.

Perhaps asking astronomy majors taking multiple statistics courses is too burdensome; therefore being taught by faculty who are specialized in (statistical) data analysis organizes a data analysis course and incorporates several hours of basic probability is more realistic and what I anticipate. With a few hours of bringing fundamental notions in random variables and probability, the claims of “statistical rigorous methods and powerful results” will become more appropriate. Currently, statistics is science but in astronomy literature, it looks more or less like an adjective that modify methods and results like “powerful”, “superior”, “excellent”, “better”, “useful,” and so on. Basics of probability is easily incorporated into introduction of algorithms in designing experiments and optimization methods, which are currently used in a brute force fashion^{[1]}.

Occasionally, I see gems from arxiv written by astronomers. Their expertise in astronomy and their interest in statistics has produced intriguing accounts for statistically rigorous data analysis and inference procedures. Their papers includes explanation of fundamentals of statistics and probability more appropriate to astronomers than statistics textbooks for scientists and engineers of different fields. I wish more astronomers join this venture knowing basics and diversities of statistics to rectify many unconscious misuses of statistics while they argue that their choice of statistics is the most powerful one thanks to plausible results.

- What I mean by a brute force fashion is that trying all methods listed in the software manual, and then later, stating that the method A gave most plausible values that matches with data in a scatter plot

My astronomer neighbor mentioned this book a while ago and quite later I found intriguing quotes.

GIGO:Garbage in; garbage out. Fancy statistical methods will not rescue garbage data.Course notes of Raymond J. Carroll (2001)

I often see a statement like *data were grouped/binned to improve statistics.* This seems hardly true unless the astronomer knows the true underlying population distribution from which those realizations (either binned or unbinned) are randomly drawn. Nonetheless, smoothing/binning (modifying sample) can help hypothesis testing to infer the population distribution. This validation step is often ignored, though. For the righteous procedures of statistics application, I hope astronomers adopt the concepts in the design of experiments to collect good quality data without wasting resources. What I mean by wasting resources is that, due to the instrumental and atmospheric limitations, indefinite exposure is not necessary to collect good quality image. Instead of human eye inspection, machine can do the job. I guess that minimax type optimal points exist for operating telescopes, feature extraction/detection, or model/data quality assessment. Clarifying the sources of uncertainty and stratifying them for testing, sampling, and modeling purposes as done in analysis of variance is quite unexplored in astronomy. Instead, more efforts go to salvaging garbage and so far, many gems are harvested by tremendous amount of efforts. But, I’m afraid that it could get as much painful as gold miners’ experience during the mid 19th century gold rush.

Interval Estimates(p.51)

A common error is to specify a confidence interval in the form (estimate – k*standard error, estimate+k*standard error). This form is applicable only when an interval estimate is desired for the mean of a normally distributed random variable. Even then k should be determined for tables of the Student’s t-distribution and not from tables of the normal distribution.

How to get appropriate degrees of freedom seems most relevant to avoid this error when estimates are the coefficients of complex curves or equation/model itself. The t-distribution with a large d.f. (>30) is hardly discernible from the z-distribution.

Desirable estimators are impartial,consistent, efficient, robust, and minimum loss. Interval estimates are to be preferred to point estimates; they are less open to challenge for they convey information about the estimate’s precision.

Every Statistical Procedure Relies on Certain Assumptionsfor correctness.

What I often fail to find from astronomy literature are these assumptions. Statistics is not elixir to every problem but works only on certain conditions.

Know your objectives in testing. Know your data’s origins. Know the assumptions you feel comfortable with. Never assign probabilities to the true state of nature, but only to the validity of your own predictions. Collecting more and better data may be your best alternative

Unfortunately, the last sentence is not an option for astronomers.

From

Guidelines for a Meta-Analysis

Kepler was able to formulate his laws only because (1) Tycho Brahe has made over 30 years of precise (for the time) astronomical observations and (2) Kepler married Brahe’s daughter and thus gained access to his data.

Not exactly same but it reflects some reality of contemporary. Without gaining access to data, there’s not much one can do and collecting data is very painstaking and time consuming.

From

Estimating Coefficient

…Finally, if the error terms come from a distribution that is far from Gaussian, a distribution that is truncated, flattened or asymmetric, the p-values and precision estimates produced by the software may be far from correct.

Please, double check numbers from your software.

To quote Green and Silverman (1994, p. 50), “There are two aims in curve estimation, which to some extent conflict with one another, to maximize goodness-of-fit and to minimize roughness.

Statistically significant findings should serve as a motivation for further corroborative and collateral research rather than as a basis for conclusions.

To be avoided are a recent spate of proprietary algorithms available solely in software form that guarantee to find a best-fitting solution. In the worlds of John von Neumann, “With four parameters I can fit an elephant and with five I can make him wiggle his trunk.” Goodness of fit is no guarantee of predictive success, …

If physics implies wiggles, then there’s nothing wrong with an extra parameter. But it is possible that best fit parameters including these wiggles might not be the ultimate answer to astronomers’ exploration. It can be just a bias due to introducing this additional parameter for wiggles in the model. Various statistical tests are available and caution is needed before reporting best fit parameter values (estimates) and their error bars.

]]>First of all, I’d like to ask how you would like to estimate the chance of having nukes in a country? What this 80% implies here? But, before getting to the question, I’d like to discuss computing the chance of e coli infection, first.

From the frequentists perspective, computing the chance of e coli infection is investigating sample of lettuce and counts species that are infected: n is the number of infected species and N is the total sample size. 1% means one among 100. Such percentage reports and their uncertainties are very familiar scene during any election periods for everyone. From Bayesian perspective, Pr(p|D) ~ L(D|p) pi(p), properly choosing likelihoods and priors, one can estimate the chance of e coli infection and uncertainty. Understanding of sample species and a prior knowledge helps to determine likelihoods and priors.

How about the chance that country A has nukes? Do we have replicates of country A so that a committee investigate each country and count ones with nukes to compute the chance? We cannot do that. Traditional frequentist approach, based on counting, does not work here to compute the chance. Either using **fiducial likelihood** approach or **Bayesian ** approach, i.e. carefully choosing a likelihood function adequately (priors are only for Bayesian) allows one to compuate such probability of interest. In other words, those computed chances highly depend on the choice of model and are very subjective.

So, here’s my concern. It seems like that astronomers want to know the chance of their spectral data being described by a model (A*B+C)*D (each letter stands for one of models such as listed in Sherpa Models). This is more like computing the chance of having nukes in country A, not counting frequencies of the event occurrence. On the other hand, p-value from goodness of fit tests, LRTs, or F-tests is a number from the traditional frequentists’ counting approach. In other words, p-value accounts for, under the null hypothesis (the (A*B+C)*D model is the right choice so that residuals are Gaussian), how many times one will observe the event (say, reduced chi^2 >1.2) if the experiments are done N times. The problem is that we only have one time experiment and that one spectrum to verify the (A*B+C)*D is true. Goodness of fit or LRT only tells the goodness or the badness of the model, not the statistically and objectively quantified chance.

In order to know the chance of the model (A*B+C)*D, like A has nuke with p%, one should not rely on p-values. If you have multiple models, one could compute pairwise relative chances i.e. odds ratios, or Bayes factors. However this does not provide the uncertainty of the chance (astronomers have the tendency of reporting uncertainties of any point estimates even if the procedure is statistically meaningless and that quantified uncertainty is not statistical uncertainty, as in using delta chi^2=1 to report 68% confidence intervals). There are various model selection criteria that cater various conditions embedded in data to make a right model choice among other candidate models. In addition, post-inference for astronomical models is yet a very difficult problem.

In order to report the righteous chance of (A*B+C)*D requires more elaborated statistical modeling, always brings some fierce discussions between frequentists and Bayesian because of priors and likelihoods. Although it can be very boring process, I want astronomers to leave the problem to statisticians instead of using inappropriate test statistics and making creative interpretation of statistics.

Please, keep this question in your mind when you report probability: what kind of chance are you computing? The chance of e coli infection? Or the chance that A has nukes? Make sure to understand that p-values from data analysis packages does not tell you that the chance the model (A*B+C)*D is (one minus p-value)%. You don’t want to report one minus p-value from a chi-square test statistic as the chance that A has nukes.

]]>The biggest challenge for a statistician to use astronomical data is the lack of mercy for nonspecialists’ accessing data including format, quantification, and qualification^{[2]} ; and data analysis systems. IDL is costly although it is used in many disciplines and other tools in astronomy are hardly utilized for different projects.^{[3]} In that regards, I welcome astronomers using python to break such exclusiveness in astronomical data analysis systems.

Even if data and software issues are resolved, there’s another barrier to climb. **Validation.** If you have a catalog, you’ll see variables of measures, and their errors typically reflecting the size of PSF and its convolution to those metrics. If a model of gaussian assumption applied, in order to tabulate power law index, King’s, Petrosian’s, or de Vaucouleurs’ profile index, and numerous metrics, I often fail to find any validation of gaussian assumptions, gaussian residuals, spectral and profile models, outliers, and optimal binning. Even if a data set is publicly available, I also fail to find how to read in raw data, what factors must be considered, and what can be discarded because of unexpected contamination occurred like cosmic rays and charge over flows. How would I validate raw data that are read into a data analysis system is correctly processed to match values in catalogs? How would I know all entries in catalog are ready for further scientific data analysis? Are those sources real? Is p-value appropriately computed?

I posted an article about Chernoff faces applied to Capella observations from Chandra. Astronomers already processed the raw data and published a catalog of X-ray spectra. Therefore, I believe that the information in the catalog is validated and ready to be used for scientific data analysis. I heard that repeated Capella observation is for the calibration. Generally speaking, in other fields, targets for calibration are almost time invariant and exhibit consistency. If Capella is a same star over the 10 years, the faces in my post should look almost same, within measurement error; but as you saw, it was not consistent at all. Those faces look like observations were made toward different objects. So far I fail to find any validation efforts, explaining why certain ObsIDs of Capella look different than the rest. Are they real Capella? Or can I use this inconsistent facial expression as an evidence that Chandra calibration at that time is inappropriate? Or can I conclude that Capella was a wrong choice for calibration?

Due to the lack of quantification procedure description from the raw data to the catalog, what I decided to do was accessing the raw data and data processing on my own to crosscheck the validity in the catalog entries. The benefit of this effort is that I can easily manipulate data for further statistical inference. Although reading and processing raw data may sound easy, I came across another problem, **lack of documentation** for nonspecialists to perform the task.

A while ago, I talked about read.table() in R. There are slight different commands and options but without much hurdle, one can read in ascii data in various styles easily with read.table() for **exploratory data analysis** and **confirmatory data analysis** with R. From my understanding, statisticians do not spend much time on reading in data nor collecting them. We are interested in methodology to extract information of the population based on sample. While the focus is methodology, all the frustrations with astronomical data analysis softwares occur prior to investigating the best method. The level of frustration reached to the extend of terminating my eagerness for more investigation about inference tools.

In order to assess those Capella observations, thanks to its on-site help, I evoke ciao. Beforehand, I’d like to disclaim that I exemplify ciao to illustrate the culture difference that I experienced as a statistician. It was used to discuss why I think that astronomical data analysis systems are short of documentations and why that astronomical data processing procedures are lack of validation. I must say that I confront very similar problems when I tried to learn astronomical packages such as IRAF and AIPS. Ciao happened to be at handy when writing this post.

In order to understand X-ray data, not only image data files, one also needs effective area (arf), redistribution matrix (rmf), and point spread function (psf). These files are called by calibration data files. If the package was developed for general users, like read.table() I expect there should be a homogenized/centralized data including calibration data reading function with options. Instead, there were various kinds of functions one can use to read in data but the description was not enough to know which one is doing what. What is the functionality of these commands? Which one only stores names of data file? Which one reconfigures the raw data reflecting up to date calibration file? Not knowing complete data structures and classes within ciao, not getting the exact functionality of these data reading functions from ahelp, I was not sure the log likelihood that I computed is appropriate or not.

For example, there are five different ways to associate an arf: read_arf(), load_arf(), set_arf(), get_arf(), and unpack_arf() from ciao. Except unpack_arf(), I couldn’t understand the difference among these functions for accessing an arf^{[4]} Other softwares including XSPEC that I use, in general, have a single function with options to execute different level of reading in data. Ciao has an extensive web documentation without a tutorial (see my post). So I read all **ahelp “commands”** a few times. But I still couldn’t decide which one to use for my work to read in arfs and rmfs (I happened to have many calibration data files).

arf | rmf | psf | pha | data | |
---|---|---|---|---|---|

get | get_arf | get_rmf | get_psf | get_pha | get_data |

set | set_arf | set_rmf | set_psf | set_pha | set_data |

unpack | unpack_arf | unpack_rmf | unpack_psf | unpack_pha | unpack_data |

load | load_arf | load_rmf | load_psf | load_pha | load_data |

read | read_arf | read_rmf | read_psf | read_pha | read_data |

[Note that above links may not work since ciao documentation website evolves quickly. Some might be routed to different links so please, check this website for other data reading commands: cxc.harvard.edu/sherpa/ahelp/index_alphabet.html].

So, I decide to seek for a help through **cxc help desk** several months back. Their answers are very reliable and prompt. My question was “what are the difference among *read_xxx(), load_xxx(), set_xxx(), get_xxx(), and unpack_xxx(),* where *xxx can be data, arf, rmf, and psf?*” The answer to this question was that

You can find detailed explanations for these Sherpa commands in the “ahelp” pages of the Sherpa website:

http://cxc.harvard.edu/sherpa/ahelp/index_alphabet.html

This is a good answer but a big cultural shock to a statistician. It’s like having an answer like “check http://www.r-project.org/search.html and http://cran.r-project.org/doc/FAQ/R-FAQ.html” for IDL users to find out the difference between read.table() and scan(). Probably, for astronomers, all above various data reading commands are self explanatory like R having read.table(), read.csv(), and scan(). Disappointingly, this answer was not I was looking for.

Well, thanks to this embezzlement, hesitation, and some skepticism, I couldn’t move to the next step of implementing fitting methods. At the beginning, I was optimistic when I found out that Ciao 4.0 and up is python compatible. I thought I could do things more in statistically rigorous ways since I can fake spectra to **validate** my fitting methods. I was thinking about modifying the indispensable chi-square method that is used twice for point estimation and hypothesis testing that introduce bias (a link made to a posting). My goal was make it less biased and robust, less sensitive iid Gaussian residual assumptions. Against my high expectation, I became frustrated at the first step, reading and playing with data to get a better sense and to develop a quick intuition. I couldn’t even make a baby step to my goal. I’m not sure if it a good thing or not, but I haven’t been completely discouraged. Also, time helps gradually to overcome this culture difference, **the lack of documentation.**

What happens in general is that, if a predecessor says, use “set_arf(),” then the apprentice will use “set_arf()” without doubts. If you begin learning on your own purely relying on documentations, I guess at some point you have to make a choice. One can make a lucky guess and move forward quickly. Sometimes, one can land on miserable situation because one is not sure about his/her choice and one cannot trust the features appeared after these processing. I guess it is natural to feel curiosity about what each of these commands is doing to your data and what information is carried over to the next commands in analysis procedures. It seems righteous to know what command is best for the particular **data processing and statistical inference** given the data. What I found is that such comparison across commands is missing in documentations. This is why I thought astronomical data analysis systems are short of mercy for nonspecialists.

Another thing I observed is that there seems no documentation nor standard procedure to create the repeatable data analysis results. My observation of astronomers says that with the same raw data, the results by scientist A and B are different (even beyond statistical margins). There are experts and they have knowledge to explain why results are different on the same raw data. However, not every one can have luxury of consulting those few experts. I cannot understand such exclusiveness instead of standardizing the procedures through validations. I even saw that the data that A analyzed some years back can be different from this year’s when he/she writes a new proposal. I think that the time for recreating the data processing and inference procedure to explain/justify/validate the different results or to cover/narrow the gap could have not been wasted if there are standard procedures and its documentation. This is purely a statistician’s thought. As the comment in *where is ciao X?*^{[5]} not every data analysis system has to have similar design and goals.

Getting lost while figuring out basics (handling, arf, rmf, psf, and numerous case by case corrections) prior to applying any simple statistics has been my biggest obstacle in learning astronomy. The lack of documenting validation methods often brings me frustration. I wonder if there’s any astronomers who lost in learning statistics via R, minitab, SAS, MATLAB, python, etc. As discussed in **where is ciao X?** I wish there is a centralized tutorial that offers basics, like how to read in data, how to do manipulate datum vector and matrix, how to do arithmetics and error propagation adequately not violating assumptions in statistics (I don’t like the fact that the point estimate of background level is subtracted from observed counts, random variable when the distribution does not permit such scale parameter shifting), how to handle and manipulate fits format files from Chandra for various statistical analysis, how to do basic image analysis, how to do basic spectral analysis, and so on with references^{[6]}

- This is quite an overdue posting. Links and associated content can be outdated.
- For the classification purpose, data with clear distinction between response and predictor variables so called a training data set must be given. However, I often fail to get processed data sets for statistical analysis. I first spend time to read data and question what is outlier, bias, or garbage. I’m not sure how to clean and extract numbers for statistical analysis and every sub-field in astronomy have their own way to clean to be fed into statistics and scatter plots. For example, image processing is still executed case by case via trained eyes of astronomers. On the other hand, in medical imaging diagnosis specialists offer training sets with which scientists in computer vision develop algorithms for classification. Such collaboration yields accelerated, automatic but preliminary diagnosis tools. A small fraction of results from these preliminary methods still can be ambiguous, i.e. false positive or false negative. Yet, when such ambiguous cancerous cell images at the decision boundaries occur, specialists like trained astronomers scrutinize those images to make a final decision. As medical imaging and its classification algorithms resolve the issue of expert shortage under overflowing images, I wish astronomers adopt their strategies to confront massive streaming images and to assist sparse trained astronomers
- Something I like to see is handling background statistically in high energy astrophysics. When simulating a source, background can be simulated as well via Makov Random field, kriging, and other spatial statistics methods. In reality, background is subtracted once in measurement space and the random nature of background is not interactively reflected. Regardless of available statistical methodology to reflect the nature of background, it is difficult to implement it for trial and validation because those tools are not compatible for adding statistical modules and packages.
- A Sherpa expert told me there is an FAQ (I failed to locate previously) on this matter. However, from data analysis perspective like a distinction between data.structure, vector, matrix, list and other data types in R, the description is not sufficient for someone who wants to learn ciao and to perform scientific (both deterministic or stochastic) data analysis via scripting i.e. handling objects appropriately. You might want to read comparing commands in Sharpa from Shepa FAQ
- I know there is ciaox. Apart from the space between ciao and X, there is another difference that astronomers do not care much compared to statisticians: the difference between X and x. Typically, the capital letter is for random variable and lower case letters for observation or value
- By the way, there are ciao workshop materials available that could function as tutorials. Please, locate them if needed.

[arXiv:math.ST:0907.4728]

A survey of cross validation procedures for model selectionby Sylvain Arlot

Nonetheless, I’ll not review the paper itself except some quotes:

-CV is a popular strategy for model selection, and algorithm selection.

-Compared to the resubstitution error, CV avoids overfitting because the training sample is independent from the validation sample.

-A noticed in the early 30s by Larson (1931), training an algorithm and evaluating its statistical performance on the same data yields an overoptimistic results.

There are books on statistical resampling methods covering more general topics, not limited to model selection. Instead, I decide to do a little search how CV is used in astronomy. These are the ADS search results. More publications than I expected.

- Kernel regression for determining photometric redshifts from Sloan broad-band photometry [arXiv:0706.2704]

Wang, D.; Zhang, Y. X.; Liu, C.; Zhao, Y. H.

Monthly Notices of the Royal Astronomical Society, Volume 382, Issue 4, pp. 1601-1606 (2007) - STECKMAP: STEllar Content and Kinematics from high resolution galactic spectra via Maximum A Posteriori [arXiv:0507002]

Ocvirk, P.; Pichon, C.; Lançon, A.; Thiébaut, E.

Monthly Notices of the Royal Astronomical Society, Volume 365, Issue 1, pp. 74-84 (2006) - STECMAP: STEllar Content from high-resolution galactic spectra via Maximum A Posteriori [arXiv:0505209]

Ocvirk, P.; Pichon, C.; Lançon, A.; Thiébaut, E.

Monthly Notices of the Royal Astronomical Society, Volume 365, Issue 1, pp. 46-73 (2006) - Automated Detection of Classical Novae with Neural Networks [arXiv:0604236]

Feeney, S. M et al.

The Astronomical Journal, Volume 130, Issue 1, pp. 84-94 (2005) - Estimation of regularization parameters in multiple-image deblurring[arxiv:0405545]

Vio, R.et al.

Astronomy and Astrophysics, v.423, p.1179-1186 (2004) - Machine learning and image analysis for morphological galaxy classification

de la Calleja, Jorge and Fuentes, Olac

Monthly Notices of the Royal Astronomical Society, Volume 349, Issue 4, pp. 87-93 (2004) - Ensembles of Classifiers for Morphological Galaxy Classification

Bazell, D.; Aha, David W.

The Astrophysical Journal, Volume 548, Issue 1, pp. 219-223.(2001) - Bayesian image reconstruction with space-variant noise suppression

Nunez, J.; Llacer, J.

Astronomy and Astrophysics Supplement, v.131, p.167-180 (1998) - Estimating the sun’s rotation from solar oscillations by regularisation

Thompson, A. M.

Astronomy and Astrophysics (ISSN 0004-6361), vol. 265, no. 1, p. 289-295. (1992)

One can easily grasp that many adopted CV under the machine learning context. The application of CV, and bootstrapping is not limited to machine learning. As Arlot’s title, CV is used for model selection. When it come to model selection in high energy astrophysics, not CV but reduced chi^2 measures and fitted curve eye balling are the standard procedure. Hopefully, a renovated model selection procedure via CV or other statistically robust strategy soon challenge the reduced chi^2 and eye balling. On the other hand, I doubt that it’ll come soon. Remember, eyes are the best classifier so it won’t be a easy task.

]]>因为有了统计

我可以把天上的星星重新梳理

Because of statistics

I can rearrange the stars in the skies above

Indeed. Especially so when the PSF is broad and the stars overlap.

(via)

]]>I would however change the emphasis away from Gaussians to Poisson.

**PS:** Apparently I can’t embed videos here. Anyway, his thesis is that mathematical education over the past century has concentrated entirely on taking students to the pinnacle of being able to do Calculus and leave them there high and dry. Instead, he suggests that the ultimate goal of mathematical education should be statistics, which is of more practical use to more people. I don’t think he means to suggest that Calculus should not be taught — I couldn’t even understand propagation of errors without Calculus — but rather that the emphasis must shift to a more practical goal. I was brought up to believe that all the world is a partial differential equation, but even I can see that this is a sensible suggestion.

**A Fast Thresholded Landweber Algorithm for Wavelet-Regularized Multidimensional Deconvolution**

Vonesch and Unser (2008)

IEEE Trans. Image Proc. vol. 17(4), pp. 539-549

Quoting the authors, I also like to say that __the recovery of the original image from the observed is an ill-posed problem__. They traced the efforts of wavelet regularization in deconvolution back to a few relatively recent publications by astronomers. Therefore, I guess the topic and algorithm of this paper could drag some attentions from astronomers.

They explain the wavelet based reconstruction procedure in a simple term. The matrix-vector product w_{x}= Wx yields the coefficients of x in the wavelet basis, and W^{T}Wx reconstructs the signal from these coefficients.

Their assumed model is

y=Hx_{orig}+ b,

where y and x_{orig} are vectors containing uniform samples of the original and measured signals; b represents the measurement error. H is a square (block) circulant matrix that approximates the convolution with the PSF. Then, the problem of deconvolution is to find an estimate that maximizes the cost function

J(x) = J_{data}(x)+ λ J_{reg}(x)

They described that “__this functional can also interpreted as a (negative) log-likelihood in a Bayesian statistical framework, and deconvolution can then be seen as a maximum a posteriori (MAP) estimation problem.__” Also the description of the cost function is applicable to the frequently appearing topic in regression or classification problems such as ridge regression, quantile regression, LASSO, LAR, model/variable selection, state space models from time series and spatial statistics, etc.

The observed image is the d-dimensional covolution of an origianl image (the characteristic function of the object of interest) with the

impulse response (or PSF).of the imaging system.

The notion of regularization or penalizing the likelihood seems not well received among astronomers based on my observation that often times the chi-square minimization (the simple least square method) without penalty is suggested and used in astronomical data analysis. Since image analysis with wavelets popular in astronomy, the fast algorithm for wavelet regularized variational deconvolution introduced in this paper could bring faster results to astronomers and could offer better insights of the underlying physical processes by separating noise and background more in a model according fashion, not simple background subtraction.

]]>