I’ve noticed that there are rapidly growing interests and attentions in data mining and machine learning among astronomers but the level of execution is yet rudimentary or partial because there has been no comprehensive tutorial style literature or book for them. I recently introduced a machine learning book written by an engineer. Although it’s a very good book, it didn’t convey the foundation of machine learning built by statisticians. In the quest of searching another good book so as to satisfy the astronomers’ pursuit of (machine) learning methodology with the proper amount of statistical theories, the first great book came along is **The Elements of Statistical Learning**. It was chosen for this writing not only because of its fame and its famous authors (Hastie, Tibshirani, and Friedman) but because of my personal story. In addition, the 2nd edition, which contains most up-to-date and state-of-the-art information, was released recently.

First, the book website:

The Elements of Statistical Learning byHastie, Tibshirani, and Friedman

You’ll find examples, R codes, relevant publications, and plots used in the text books.

Second, I want to tell how I learned about this book before its first edition was published. Everyone has a small moment of meeting very famous people. Mine is shaking hands with President Clinton, in 2000. I still remember the moment vividly because I really wanted to tell him that ice cream was dripping on his nice suit but the top of the line guards blocked my attempt of speaking/pointing icecream dripping with a finger afterward the hand shaking. No matter what context is, shaking hands with one of the greatest presidents is a memorable thing. Yet it was not my cherishing moment because of icecreaming dripping and scary bodyguards. My most cherishing moment of meeting famous people is the half an hour conversation with late Prof. Leo Breinman (click for my two postings about him), author of probability textbook, creator of CART, and the most forefront pioneer in machine learning.

The conclusion of that conversation was a book soon to be published after explaining him my ideas of applying statistics to astronomical data and his advices to each problems. I was not capable to understand every statistics so that his answer about this new coming book at that time was the most relevant and apt one.

This conversation happened during the 3rd Statistical Challenges in Modern Astronomy (SCMA). Not long passed since I began my graduate study in statistics but had an opportunity to assist the conference organizer, my advisor Dr. Babu and to do some chores during the conference. By accident, I read the book by Murtagh about multivariate data analysis, so I wanted to speak to him. Except that, I have no desire to speak renown speakers and attendees. Frankly, I didn’t have any idea who’s who at the conference and a few years later, I realized that the conference dragged many famous people and the density of such people was higher than any conference I attended. Who would have imagine that I could have a personal conversation with Prof. Breiman, at that time. I have seen enough that many famous professors train people during conferences. Getting a chance for chatting some seconds are really hard and tall/strong people push someone small like me away always.

The story goes like this: a sunny perfect early summer afternoon, he was taking a break for a cigar and I finished my errands for the session. Not much to do until the end of session, I decided to take some fresh air and I spotted him enjoying his cigar. Only the worst was that I didn’t know he was the person of CART and the founder of statistical machine learning. Only from his talk from the previous session, I learned he was a statistician, who did data mining on galaxies. So, I asked him if I can join him and ask some questions related to some ideas that I have. One topic I wanted to talk about classification of SN light curves, by that time from astronomical text books, there are Type I & II, and Type I has subcategories, Ia, Ib, and Ic. Later, I heard that there is Type III. But the challenge is observations didn’t happen with equal intervals. There were more data mining topics and the conversation went a while. In the end, he recommended me a book which will be published soon.

Having such a story, a privilege of talking to late Prof. Breiman through an very unique meeting, SCMA, before knowing the fame of the book, this book became one of my favorites. The book, indeed, become popular, around that time, almost only book discussing statistical learning; therefore, it was an excellent textbook for introducing statistics to engineerers and machine learning to statisticians. In the mean time, statistical learning enjoyed popularity in many disciplines that have data sets and urging for learning with the aid of machine. Now books and journals on machine learning, data mining, and knowledge discovery (KDD) became prosperous. I was so delighted to see the 2nd edition in the market to bridge the gap over the years.

I thank him for sharing his cigar time, probably his short free but precious time for contemplation, with me. I thank his patience of spending time with such an ignorant girl with a foreign english accent. And I thank him for introducing a book which will became a bible in the statistical learning community within a couple of years (I felt proud of myself that I access the book before people know about it). Perhaps, astronomers cannot have many joys from this book that I experienced from how I encounter the book, who introduced the book, whether the book was used in a course, how often book is referred, etc. But I assure that it’ll narrow the gap in the notions how astronomers think about data mining (preprocessing, pipelining, and bulding catalogs) and how statisticians treat data mining. The newly released 2nd edition would help narrowing the gap further and assist astronomers to coin brilliant learning algorithms specific for astronomical data. [The END]

—————————– Here, I patch my scribbles about the book.

What distinguish this book from other machine learning books is that not only authors are big figures in statistics but also fundamentals of statistics and probability are discussed in all chapters. Most of machine learning books only introduce elementary statistics and probability in chapter 2, and no basics in statistics is discussed in later chapters. Generally, empirical procedures, computer algorithms, and their results without presenting basic theories in statistics are presented.

You might want to check the book’s website for data sets if you want to try some ideas described there

The Elements of Statistical Learning

In addition to its historical footprint in the field of statistical learning, I’m sure that some astronomers want to check out topics in the book. It’ll help to replace some data analysis methods in astronomy celebrating their centennials sooner or later with state of the art methods to cope with modern data.

This new edition reflects some evolutions in statistical learning whereas the first edition has been an excellent harbinger of the field. Pages quoted from the 2nd edition.

[p.28] Suppose in fact that our data arose from a statistical model $Y=f(X)+e$ where the random error e has E(e)=0 and is independent of X. Note that for this model, f(x)=E(Y|X=x) and in fact the conditional distribution Pr(Y|X) depends on X

onlythrough the conditional mean f(x).

The additive error model is a useful approximation to the truth. For most systems the input-output pairs (X,Y) will not have deterministic relationship Y=f(X). Generally there will be other unmeasured variables that also contribute to Y, including measurement error. The additive model assumes that we can capture all these departures from a deterministic relationship via the error e.

How statisticians envision “model” and “measurement errors” quite different from astronomers’ “model” and “measurement errors” although in terms of “additive error model” they are matching due to the properties of Gaussian/normal distribution. Still, the dilemma of hen or eggs exists prior to any statistical analysis.

[p.30] Although somewhat less glamorous than the learning paradigm, treating supervised learning as a problem in function approximation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probabilistic inference to be applied to the problem. This is the approach taken in this book.

Strongly recommend to read chapter 3, Linear Methods for Regression: In astronomy, there are so many important coefficients from regression models, from Hubble constant to absorption correction (temperature and magnitude conversion is another example. It seems that these relations can be only explained via OLS (ordinary least square) with the homogeneous error assumption. Yet, books on regressions and linear models are not generally thin. As much diversity exists in datasets, more amount of methodology, theory and assumption exists in order to reflect that diversity. One might like to study the statistical properties of these indicators based on mixture and hierarchical modeling. Some inference, say population proportion can be drawn to verify some hypotheses in cosmology in an indirect way. Understanding what regression analysis and assumptions and how statistician efforts made these methods more robust and interpretable, and reflecting reality would change forcing E(Y|X)=aX+b models onto data showing correlations (not causality).

]]>**2010-apr-30:** Aneta has setup a blogspot site to deal with simple Sherpa techniques and tactics: http://pysherpa.blogspot.com/

On Help:

- In general, to get help, use:
`ahelp "something"`

(note the quotes) - Even more useful, type:
`? wildcard`

to get a list of all commands that include the`wildcard`

- You can also do a form of autocomplete: type TAB after writing half a command to get a list of all possible completions.

Data I/O:

- To read in your PHA file, use:
`load_pha()`

- Often for Chandra spectra, the background is included in that same file. In any case, to read it in separately, use:
`load_bkg()`

- Q: should it be loaded in to the same dataID as the source?
- A: Yes.
- A: When the background counts are present in the same file, they can be read in separately and assigned to the background via
`set_bkg('src',get_data('bkg'))`

, so counts from a different file can be assigned as background to the current spectrum.

- To read in the corresponding ARF, use:
`load_arf()`

- Q:
`load_bkg_arf()`

for the background — should it be done before or after`load_bkg()`

, or does it matter? - A: does not matter

- Q:
- To read in the corresponding RMF, use:
`load_rmf()`

- Q:
`load_bkg_rmf()`

for the background, and same question as above - A: same answer as above; does not matter.

- Q:
- To see the structure of the data, type:
`print(get_data())`

and`print(get_bkg())`

- To select a subset of channels to analyze, use:
`notice_id()`

- To subtract background from source data, use:
`subtract()`

- To not subtract, to undo the subtraction, etc., use:
`unsubtract()`

- To plot useful stuff, use:
`plot_data(), plot_bkg(), plot_arf(), plot_model(), plot_fit(),`

etc. - (Q: how in god’s name does one avoid plotting those damned error bars? I know error bars are necessary, but when I have a spectrum with 8192 bins, I don’t want it washed out with silly 1-sigma Poisson bars. And while we are asking questions, how do I change the units on the y-axis to counts/bin? A: rumors say that
`plot_data(1,yerr=0)`

should do the trick, but it appears to be still in the development version.)

Fitting:

- To fit model to the data, command it to:
`fit()`

- To get error bars on the fit parameters, use:
`projection()`

(or`covar()`

, but why deliberately use a function that is guaranteed to underestimate your error bars?) - Defining models appears to be much easier now. You can use syntax like:
`set_source(`

(where you can distinguish between different instances of the same type of model using the ModelID — e.g.,*ModelName.ModelID+AnotherModel.ModelID2*)`set_source(`

)*xsphabs.abs1*powlaw1d.psrc+powlaw1d.pbkg*) - To see what the model parameter values are, type:
`print(get_model())`

- To change statistic, use: set_stat() (options are various
`chisq`

types,`cstat,`

and`cash`

) - To change the optimization method, use:
`set_method()`

(options are`levmar, moncar, neldermead, simann, simplex`

)

`v1:2007-dec-18`

`v2:2008-feb-20`

`v3:2010-apr-30`

]]>When I was teaching statistics, despite undergraduate courses, there were both undergraduate and graduate students of various fields except astrophysics majors. I wondered why they were not encouraged to take some basic statistics whereas they were encouraged to take some computer science courses. As there are many astronomers good at programming and designing tools, I’m sure that recommending students to take statistics courses will renovate astronomical data analysis procedures (beyond Bevington’s book) and hind theories (statistics and mathematics per se, not physics laws).

Here’s an interesting lecture for developing a curriculum for the new era in computer science and why the basic probability theory and statistics is important to raise versatile computer scientists. It could be a bit out dated now because I saw it several months ago.

About a little more than the half way through the lecture, he emphasizes that probability course partaking the computer science curriculum. I wonder any astronomy professor has similar arguments and stresses for any needs of basic probability theories to be learned among young future astrophysicists in order to prevent many statistics misuses appearing in astronomical literature. Particularly confusions between fitting (estimating) and inference (both model assessment and uncertainty quantification) are frequently observed in literature where authors claim their superior statistics and statistical data analysis. I personally sometimes attribute this confusion to the lack of distinction between what is random and what is deterministic, or strong believe in their observed and processed data absent from errors and probabilistic nature.

Many introductory books introduce very interesting problems many of which have some historical origins to introduce probability theories (many anecdotes). One can check out the very basics, **probability axioms**, ** and ****measurable function** from wikipedia. With examples, probability is high school or lower level math that you already know but with jargon you’ll like to recite lexicons many times so that you are get used to foundations, basics, and their theories.

We often mention **measurable** to discuss random variables, uncertainties, and distributions without verbosity. “Assume **measurable space** … ” saves multiple paragraphs in an article and changes the structure of writing. This short adjective implies so many assumptions depending on statistical models and equations that you are using for best fits and error bars.

Consider a LF, that is truncated due to observational limits. The common practice I saw is drawing a histogram in a way that the adaptive binning makes the overall shape reflecting a partial bell shape curve. Thanks to its smoothed look, scientists impose a gaussian curve to partially observed data and find parameter estimates that determine the shape of this gaussian curve. There is no imputation step to fake unobserved points to comprise the full probability space. The parameter space of gaussian curves frequently does not coincide with the physically feasible space; however, such discrepancy is rarely discussed in astronomical literature and subsequent biased results look like a taboo.

Although astronomers emphasize the importance of uncertainties, factorization nor stratification of uncertainties has never been clear (model uncertainty, systematic uncertainty or bias, statistical uncertainties or variance). Hierarchical relationships or correlations among these different uncertainties are never addressed in a full measure. Basics of probability theory and the understanding of random variables would help to characterize uncertainties both in mathematical sense and astrophysical sense. This knowledge also assist appropriate quantification of these characterized uncertainties.

Statistical models are rather simple compared to models of astrophysics. However, statistics is the science of understanding uncertainties and randomness and therefore, some strategies of transcribing from complicated astrophysical models into statistical models, in order to reflect the probabilistic nature of observed (or parameters, for Bayesian modeling), are necessary. Both raw or processed data manifest the behavior of random variables. Their underlying processes determine not only physics models but also statistical models written in terms of random variables and the link functions connecting physics and uncertainties. To my best understanding, bridging and inventing statistical models for astrophysics researches seem tough due to the lack of awareness of basics of probability theory.

Once I had a chance to observe a Decadal survey meeting, which covered so diverse areas in astronomy. They discussed new projects, advancing current projects, career developments, and a little bit about educating professional astronomers apart from public reach (which often receives more importance than university curriculum. I also believe that wide spread public awareness of astronomy is very important). What I missed while I observing the meeting is that interdisciplinary knowledge transferring efforts to broaden the field of astronomy and astrophysics nor curriculum design ideas. Because of its long history, I thought astronomy is a science of everything. Marching a path for a long time made astronomy more or less the most isolated and exclusive science.

Perhaps asking astronomy majors taking multiple statistics courses is too burdensome; therefore being taught by faculty who are specialized in (statistical) data analysis organizes a data analysis course and incorporates several hours of basic probability is more realistic and what I anticipate. With a few hours of bringing fundamental notions in random variables and probability, the claims of “statistical rigorous methods and powerful results” will become more appropriate. Currently, statistics is science but in astronomy literature, it looks more or less like an adjective that modify methods and results like “powerful”, “superior”, “excellent”, “better”, “useful,” and so on. Basics of probability is easily incorporated into introduction of algorithms in designing experiments and optimization methods, which are currently used in a brute force fashion^{[1]}.

Occasionally, I see gems from arxiv written by astronomers. Their expertise in astronomy and their interest in statistics has produced intriguing accounts for statistically rigorous data analysis and inference procedures. Their papers includes explanation of fundamentals of statistics and probability more appropriate to astronomers than statistics textbooks for scientists and engineers of different fields. I wish more astronomers join this venture knowing basics and diversities of statistics to rectify many unconscious misuses of statistics while they argue that their choice of statistics is the most powerful one thanks to plausible results.

- What I mean by a brute force fashion is that trying all methods listed in the software manual, and then later, stating that the method A gave most plausible values that matches with data in a scatter plot

There are quite many websites dedicated to python as you already know. Some of them talk only to astronomers. A tiny fraction of those websites are for statisticians but I haven’t met any statistician preferring only python. We take the gist of various languages. So, I’ll leave a general website aggregation, such as AstroPy (I think this website is extremely useful for astronomers), to enrich your bookmark under the “python” tab regardless of your profession. Instead, I’ll discuss some python libraries and modules that can be useful for those exercising astrostatistics and make their work easier. I must say that by intention I omitted a few modules because I was not sure their publicity and copyright sensitivity. If you have modules that can be introduced publicly, let me know. I’ll be happy to add them. If my description is improper and want them to be taken off, also let me know.

Over the past few years, python became the most common and versatile script language for both communities, and therefore, I believe, it would accelerate many collaborations. Much of my time is spent to find out how to read, maneuver, and handle raw data/image. Most of tactics for astronomers are quite unfamiliar, sometimes insensible to me (see my read.table() and data analysis system and its documentation). Somehow, one script language, thanks to its open and free intention to all communities, is promising by narrowing the gap for prosperous and efficient collaborations, **Python**

The first posting on this slog was about __Python.__ I thought that kicking off with a computer language relatively new and open to many communities could motivate me and others for more interdisciplinary works with diversity. After a few years, unfortunately, I didn’t achieve that goal. Yet, I still think that these libraries and modules, introduced below, to be useful for your transition from some programming languages, or for writing your own but pro bono wrapper for better communication with the others.

I’ll take numpy, scipy, and RPy for granted. For the plotting purpose, matplotlib seems most common.

**Reading astronomical data** (click links to download libraries, modules, and tutorials)

- First, start with Using Python for Interactive Data Analysis (in pdf) Quite useful manual, particularly for IDL users. It compares pros and cons of Python and IDL.
- IDLsave Simply, without IDL, a .save file becomes legible. This is a brilliant small module.
- PyRAF (I was really frustrated with IRAF and spent many sleepless nights. Apart from data reduction, I don’t remember much of statistics from IRAF except simple statistics for Gaussian populations. I guess PyRAF does better job). And there’s PyFITS for handling fits format data.
- APLpy (the Astronomical Plotting Library in Python) is a Python module aimed at producing publication-quality plots of astronomical imaging data in FITS format (this introduction is copied from the APLpy site).

**Statistics, Mathematics, or data science**

Due to RPy, introducing smaller modules seems not much worthy but quite many modules and library for statistics are available, not relying on R.

- MDP (Modular toolkit for Data Processing)

Multivariate data analysis methods like PCA, ICA, FA, etc. become very popular in the astronomical society. - pywavelets (Not only FT, various transformation methodologies are often used and wavelet transformation ranks top).
- PyIMSL (see my post, PyIMSL)
- PyMC I introduced this module in a century ago. It may be lack of versatility or robustness due to parametric distribution objects but I liked the tutorial very much from which one can expand and devise their own working MCMC algorithm.
- PyBUGS (I introduced this python wrapper in BUGS but the link to PyBUGS is not working anymore. I hope it revives.)
- SAGE (Software for Algebra and Geometry Experimentation) is a free open-source mathematics software system licensed under the GPL (Link to the online tutorial).
- python_statlib descriptive statistics for the python programming language.
- PYSTAT Nice website but the product is not available yet. Be aware! It is not PhyStat!!!

**Module for AstroStatistics**

**import inference** (Unfortunately, the links to examples and tutorial are not available currently)

Without clear objectives, it is not easy to pick up a new language. If you are used to work with one from alphabet soup, you most likely adhere to your choice. Changing alphabets or transferring language names only happens when your instructor specifically ask you to use their preferring languages and when analysis {modules, libraries, tools} are only available within that preferred language. Somehow, thanks to the object oriented style, python makes transition and communication easier than other languages. Furthermore, script languages are more intuitive and better interpretable.

]]>As a part of introducing nonparametric statistics, I wanted to write about applications of computation geometry from the nonparametric 2/3 dimensional density estimation perspective. Also, the following article came along when I just began to collect statistical applications in astronomy (my [ArXiv] series). This [arXiv] paper, in fact, initiated me to investigate Voronoi Tessellations in astronomy in general.

[arxiv/astro-ph:0707.2877]

Voronoi Tessellations and the Cosmic Web: Spatial Patterns and Clustering across the Universe

by Rien van de Weygaert

Since then, quite time has passed. In the mean time, I found more publications in astronomy specifically using **tessellation** as a main tool of nonparametric density estimation and for data analysis. Nonetheless, in general, topics in spatial statistics tend to be unrecognized or almost ignored in analyzing astronomical spatial data (I mean data points with coordinate information). Many seem only utilizing statistics partially or not at all. Some might want to know how often **Voronoi tessellation** is applied in astronomy. Here, I listed results from my ADS search by limiting tessellation in title key words. :

- [arxiv/astro-ph:0110259]

Detecting Clusters of Galaxies in the Sloan Digital Sky Survey I : Monte Carlo Comparison of Cluster Detection Algorithms

by Kim, R.S.J. et al. (2002) AJ, 123, pp.20-36. - [arxiv/astro-ph:0906.1905]

The VOISE Algorithm: a Versatile Toll for Automatic Segmentation of Astronomical Images

by*Guio, P. and Achilleos, N.*(2009) - Using Voronoi Techniques to determine the shapes of photon sources

by*Wilkinson and Meurs*Irish Astronomical Journal, 1998, 25(1), 37 - High-order 3D Voronoi tessellation for identifying isolated galaxies, pairs and triplets

by*Elyiv, A.; Melnyk, O.; Vavilova, I.*2009..MNRAS..394..1409E - 3-D Voronoi’s Tessellation as a Tool for Identifying Galaxy Groups

by*Melnyk, Olga V.; Elyiv, Andrii A.; Vavilova, Iryna B.*2007..IAUS..235..223M - Adaptive binning of X-ray data with weighted Voronoi tessellations

by*Diehl, Steven; Statler, Thomas S.*2006..MNRAS..368..497D - Adaptive spatial binning of integral-field spectroscopic data using Voronoi tessellations

by*Cappellari, M. and Copin, Y.*2003..MNRAS..342..345C - Adaptive Spatial Binning of 2D Spectra and Images Using Voronoi Tessellations

by*Cappellari, M.; Copin, Y.*2002..ASPC..282..515CA - Finding galaxy clusters using Voronoi tessellations

by*Ramella, M.; Boschin, W.; Fadda, D.; Nonino, M.*2001..A&A…368..776R - The Forest Method as a New Parallel Tree Method with the Sectional Voronoi Tessellation

by*Yahagi, Hideki; Mori, Masao; Yoshii, Yuzuru*1999..ApJS..124..1 - Cluster Identification via Voronoi Tesselation ..1999..ASPC..176..108
- The accuracy of parameters determined with the core-sampling method: Application to Voronoi tessellations 1997..A&AS..123..495
- Dynamical Voronoi tessellation. V. Thickness and incompleteness.

by*Zaninetti, L*1995..A&AS..109..71 - Fragmenting the Universe. 3: The constructions and statistics of 3-D Voronoi tessellations

by*van de Weygaert, Rien*1994..A&A..283..361 - Dynamical Voronoi tessellation. IV. The distribution of the asteroids

by*Zaninetti, L*1993..A&A..276..255 - Quasi-periodic structures in the large-scale galaxy distribution and three-dimensional Voronoi tessellation

1991..MNRAS..250..519 - Dynamical Voronoi tessellation. III – The distribution of galaxies

by*Zaninetti, L*1991..A&A..246..291 - Dynamical Voronoi tessellation. II – The three-dimensional case

by*Zaninetti, L*1990..A&A..233..293 - Dynamical Voronoi tessellation. I – The two-dimensional case

by*Zaninetti, L*1989..A&A..224..345

Then, the topic has been forgotten for a while until this recent [arXiv] paper, which reminded me my old intention for introducing **tessellation** for density estimation and for understanding large scale structures or clusters (astronomers’ jargon, not the term in machine or statistical learning).

[arxiv:stat.ME:0910.1473]

Moment Analysis of the Delaunay Tessellation Field Estimator

byM.N.M van Lieshout

Looking into plots of the papers by van de Weygaert or van Lieshout, without mathematical jargon and abstraction, one can immediately understand what **Voronoi and Delaunay Tessellation** is (Delaunay Tessellation is also called as Delaunay Triangulation (wiki). Perhaps, you want to check out wiki:Delaunay Tessellation Field Estimator as well). Voronoi tessellations have been adopted in many scientific/engineering fields to describe the spatial distribution. Astronomy is not an exception. Voronoi Tessellation has been used for field interpolation.

van de Weygaert described Voronoi tessellations as follows:

- the asymptotic frame for the ultimate matter distribution,
- the skeleton of the cosmic matter distribution,
- a versatile and flexible mathematical model for weblike spatial pattern, and
- a natural asymptotic result of an evolution in which low-density expanding void regions dictate the spatial organization of the Megaparsec universe, while matter assembles in high-density filamentary and wall-like interstices between the voids.

van Lieshout derived explicit expressions for the mean and variance of Delaunay Tessellatoin Field Estimator (DTFE) and showed that for stationary Poisson processes, the DTFE is asymptotically unbiased with a variance that is proportional to the square intensity.

We’ve observed voids and filaments of cosmic matters with patterns of which theory hasn’t been discovered. In general, those patterns are manifested via observed galaxies, both directly and indirectly. Individual observed objects, I believe, can be matched to points that construct Voronoi polygons. They represent each polygon and investigating its distributional properly helps to understand the formation rules and theories of those patterns. For that matter, probably, various topics in stochastic geometry, not just Voronoi tessellation, can be adopted.

There are plethora information available on Voronoi Tessellation such as the website of International Symposium on Voronoi Diagrams in Science and Engineering. Two recent meeting websites are ISVD09 and ISVD08. Also, the following review paper is interesting.

Centroidal Voronoi Tessellations: Applications and Algorithms(1999) Du, Faber, and Gunzburger in SIAM Review, vol. 41(4), pp. 637-676

By the way, you may have noticed my preference for Voronoi Tessellation over Delaunay owing to the characteristics of this centroidal Voronoi that each observation is the center of each Voronoi cell as opposed to the property of Delaunay triangulation that multiple simplices are associated one observation/point. However, from the perspective of understanding the distribution of observations as a whole, both approaches offer summaries and insights in a nonparametric fashion, which I put the most value on.

]]>
[arXiv:stat.ME:0910.2585]

Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications

by *Murphy, Dean, and Raftery*

Classifying or clustering (or semi supervised learning) spectra is a very challenging problem from collecting statistical-analysis-ready data to reducing the dimensionality without sacrificing complex information in each spectrum. Not only how to estimate spiky (not differentiable) curves via statistically well defined procedures of estimating equations but also how to transform data that match the regularity conditions in statistics is challenging.

Another reason that astrophysics spectroscopic data classification and clustering is more difficult is that observed lines, and their intensities and FWHMs on top of continuum are related to atomic database and latent variables/hyper parameters (distance, rotation, absorption, column density, temperature, metalicity, types, system properties, etc). Frequently it becomes very challenging mixture problem to separate lines and to separate lines from continuum (boundary and identifiability issues). These complexity only appears in astronomy spectroscopic data because we only get indirect or uncontrolled data ruled by physics, as opposed to the the meat species spectra in the paper. These spectroscopic data outside astronomy are rather smooth, observed in controlled wavelength range, and no worries for correcting recession/radial velocity/red shift/extinction/lensing/etc.

Although the most relevant part to astronomers, i.e. spectroscopic data processing is not discussed in this paper, the most important part, statistical learning application to complex curves, spectral data, is well described. Some astronomers with appropriate data would like to try the variable selection strategy and to check out the classification methods in statistics. If it works out, it might save space for storing spectral data and time to collect high resolution spectra. Please, keep in mind that it is not necessary to use the same variable selection strategy. Astronomers can create better working versions for classification and clustering purpose, like Hardness Ratios, often used to reduce the dimensionality of spectral data since low total count spectra are not informative in the full energy (wavelength) range. Curse of dimensionality!.

]]>From SINGS (Spitzer Infrared Nearby Galaxies Survey): Isn’t it a beautiful Hubble tuning fork?

As a first year graduate student of statistics, because of the rumor that Prof. C.R.Rao won’t teach any more and because of his fame, the most famous statistician alive, I enrolled his “multivariate analysis” class without thinking much. Everything is smooth and easy for him and he has incredible memories of equations and proofs. However, I only grasped intuitive concepts like why the method works, not details of mathematics, theorems, and their proofs. Instantly, I began to think how methods can be applied to astronomical data. After a few lessons, I desperately wanted to try out multivariate analysis methods to classify galactic morphology.

The dream died shortly because there’s no data set that can be properly fed into statistical methods for classification. I spent quite time on searching some astronomical data bases including ADS. This was before SDSS or VizieR become popular as now. Then, I thought about applying them to classify supernovae because understanding the pattern of their light curves tells a lot of the history of our universe (Type Ia SNe are standard candle) and because I know some publicly available SN light curves. Immediately, I realize that individual light curves are biased from the sampling perspective. I do not know how to correct them for multivariate analysis. I also thought about applying multivariate analysis methods to stellar spectral types and stars of different mechanical systems (single, binary, association, etc). I thought about how to apply newly learned methods to every astronomical objects that I learned, from sunspots to AGNs.

Regardless of target objects to be scrutinized under this fascinating subject “multivariate analysis,” two factors kept discouraged me: one was that I didn’t have enough training to develop new statistical models in a couple of weeks to reflect unique statistical challenges embedded in data that have missings, irregularities, non-iid, outliers and others that are hardly transcribed into statistical setting, and the other, which was more critical, was that no accessible astronomical database repository for statistical learning. Without deep knowledge in astronomy and trained skills to handle astronomical data, catalogs are generally useless. Those catalogs and data sets in archives are different from data sets from data repositories in machine learning (these data sets are intuitive).

Astronomers would think analyzing toy/mock data sets is not scientific because it’s not leading to any new discovery which they always make. From data analyst viewpoints, scientific advances mean finding tools that summarize data in an optimal manner. As I demanded in Astroinformatics, methods for retrieving information can be attempted and validated with well understood, astrophysically devastated data sets. Pythagoras theorem was proved not only once but there are 39 different ways to prove it.

Seeing this nice poster image (the full resolution image of 56MB is available from the link), brought me some memory of my enthusiasm of applying statistical learning methods for better knowledge discovery. As you can see there are so many different types of galaxies and often times there is no clear boundary between them – consider classifying blurry galaxies by eyes: a spiral can be classified as a irregular, for example. Although I wish for automatic classification of these astrophysical objects, because of difficulties in composing a training set for classification or collecting data of distinctive manifold groups for clustering, as much as complexity that this tuning fork shows, machine learning procedures is equally complicated to be developed. Complex topology of astronomical objects seems to be the primary reason of lacking in statistical learning applications compared to other fields.

Nonetheless, multivariable analysis can be useful for viewing relations from different perspectives, apart from known physics models. It may help to develop more fine tuned physics model by taking latent variables into account that are found from statistical learning processes. Such attempts, I believe, can assist astronomers to design telescopes and to invent efficient ways to collect/analyze data by knowing which features are more significant than others to understand morphological shape of galaxies, patterns in light curves, spectral types, etc. When such experiences accumulate, different insights of physics can kick in like scientists scrambled and assembled galaxies into a tuning fork that led developing various evolution models.

To make a long story short, you have two choices: one, just enjoy these beautiful pictures and apprehend the complexity of our universe, or two, this picture of Hubble’s tuning fork can be inspirational to you for advances in astroinformatics. Whichever path you choose, it’s your time worthy.

]]>- Space Weather Research Lab at NJIT
- SEEDS — Solar Eruptive Event Detection System at George Mason University.
- CACTUS A software package for ‘Computer Aided CME Tracking
- SRON in the Netherlands

These seem quite informative and I believe more statisticians and data scientists (signal and image processing, machine learning, computer vision, and data mining) easily collaborate with solar physicists. All the complexity, as a matter of fact, comes from data processing to be fed in to (machine, statistical) learning algorithms and defining the objectives of learning. Once settled, one can easily apply numerous methods in the field to these time varying solar images.

I’m writing this short posting because I finally found those interesting articles that I collected for my previous post on Space Weather. After finding them and scanning through, I realized that methodology-wise they only made baby steps. You’ll see a limited number key words are repeated although there is a humongous society of scientists and engineers in the knowledge discovery and data mining.

Note that the objectives of these studies are quite similar. They described machine learning for the purpose of automatizing the procedure of detecting features of interest of the Sun and possible forecasting relevant phenomena that affects our own atmosphere due to associated solar activities.

*Automated Prediction of CMEs Using Machine Learning of CME – Flare Associations*by Qahwaji et al. (2008) in Solar Phy. vol 248, pp.471-483.*Automatic Short-Term Solar Flare Prediction using Machine Learning and Sunspot Associations*by Qahwaji and Colak (2007) in Solar Phy. vol. 241, pp. 195-211

Space weather is defined by the U.S. National Space Weather Probram (NSWP) as “conditions on the Sun and in the solar wind, magnetosphere, ionosphere, and thermosphere that can influence the performance and reliability of space-borne and ground-based technological systems and can endanger human life or health”

Personally thinking, the section of “jackknife” needs to be replaced with “cross-validation.”

*Automatic Detection and Classification of Coronal Mass Ejections*by Qu et al. (2006) in Solar Phy. vol. 237, pp.419-431.*Automatic Solar Filament Detection Using image Processing Techniques*by Qu et al. (2005) in Solar Phy., vol. 228, pp. 119-135*Automatic Solar Flare Tracking Using Image-Processing Techniques*by Qu, et al. (2004) in Solar Phy. vol. 222, pp. 137-149*Automatic Solar Flare Detection Using MLP, RBF, and SVM*by Qu et al. (2003) in Solar Phy. vol. 217, pp.157-172. pp. 157-172

I’d like add a survey paper on another type of learning methods beyond Support Vector Machine (SVM) used in almost all articles above. Luckily, this survey paper happened to address my concern about the “practices of background subtraction” in high energy astrophysics.

by Huo, Ni, SmithA Survey of Manifold-Based Learning methods

[Excerpt] What isManifold-Based Learning?

It is an emerging and promising approach innonparametric dimension reduction. The article reviewedprinciple component analysis, multidimensional scaling (MDS), generative topological mapping (GTM), locally linear embedding (LLE), ISOMAP, Laplacian eigenmaps, Hessian eigenmaps, and local tangent space alignment (LTSA)Apart from these revisits and comparison, this survey paper is useful tounderstand the danger of background subtraction. Homogeneity does not mean constant background to be subtracted, often cause negative source observation.

More collaborations among multiple disciplines are desired in this relatively new field. For me, it is one of the best data and information scientific fields of the 21st century and any progress will be beneficial to human kind.

- I must acknowledge him for his kindness and patience. He was my wikipedia to questions while I was studying the Sun.

http://adsabs.harvard.edu/abs/2009MNRAS.395.1733W.

Title:Compressed sensing imaging techniques for radio interferometry

Authors:Wiaux, Y. et al.

Abstract:Radio interferometry probes astrophysical signals through incomplete and noisy Fourier measurements.The theory of compressed sensing demonstrates that such measurements may actually suffice for accurate reconstruction of sparse or compressible signals.We propose new generic imaging techniques based on convex optimization for global minimization problems defined in this context. The versatility of the framework notably allows introduction of specific prior information on the signals, which offers the possibility of significant improvements of reconstruction relative to the standard local matching pursuit algorithm CLEAN used in radio astronomy. We illustrate the potential of the approach by studying reconstruction performances on simulations of two different kinds of signals observed with very generic interferometric configurations. The first kind is an intensity field of compact astrophysical objects. The second kind is the imprint of cosmic strings in the temperature field of the cosmic microwave background radiation, of particular interest for cosmology.

As discussed, reconstructing images from noisy observations is typically considered as an ill-posed problem or an inverse problem. Owing to the personal lack of comprehension in image reconstruction of radio interferometry observation based on sample from Fourier space via inverse Fourier transform, I cannot judge how good this new adaption of compressed sensing for radio astronomical imagery is. I think, however, **compressed sensing** will take over many of traditional image reconstruction tools due to their shortage in forgiving sparsely represented large data/images .

Please, check my old post on compressed sensing for more references to the subject like the Rice university repository in addition to references from Wiaux et al. It’s a new exciting field with countless applications, already enjoying wide popularity from many scientific and engineering fields. My thought is that well developed compressed sensing algorithms might resolve bandwidth issues in satellite observations/communication by transmiting more images within fractional temporal intervals for improved image reconstruction.

]]>The need of source separation methods in astronomy has led various adaptations of decomposition methods available. It is not difficult to locate those applications from journals of various fields including astronomical journals. However, they are most likely soliciting one dimension reduction method of their choice over others to emphasize that their strategy works better. I rarely come up with a paper which gathered and summarized component separation methods applicable to astronomical data. In that regards, the following paper seems useful to overview methods of reducing dimensionality for astronomers.

[arxiv:0805.0269]

Component separation methods for the Planck mission

S.M.Leach et al.

Check its appendix for method description.

Various library/modules are available through software/data analysis system so that one can try various dimension reduction methods conveniently. The only concern I have is the challenge of interpretation after these computational/mathematical/statistical analysis, *how to assign physics interpretation to images/spectra produced by decomposition*. I think this is a big open question.