The AstroStat Slog » machine learning

[Book] The Elements of Statistical Learning, 2nd Ed.

hlee — Thu, 22 Jul 2010 13:25:44 +0000

This was written more than a year ago, and I forgot to post it.

I’ve noticed that there are rapidly growing interests and attentions in data mining and machine learning among astronomers but the level of execution is yet rudimentary or partial because there has been no comprehensive tutorial style literature or book for them. I recently introduced a machine learning book written by an engineer. Although it’s a very good book, it didn’t convey the foundation of machine learning built by statisticians. In the quest of searching another good book so as to satisfy the astronomers’ pursuit of (machine) learning methodology with the proper amount of statistical theories, the first great book came along is The Elements of Statistical Learning. It was chosen for this writing not only because of its fame and its famous authors (Hastie, Tibshirani, and Friedman) but because of my personal story. In addition, the 2nd edition, which contains most up-to-date and state-of-the-art information, was released recently.

First, the book website:

The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman

You’ll find examples, R codes, relevant publications, and plots used in the text books.

Second, I want to tell how I learned about this book before its first edition was published. Everyone has a small moment of meeting very famous people. Mine is shaking hands with President Clinton, in 2000. I still remember the moment vividly because I really wanted to tell him that ice cream was dripping on his nice suit but the top of the line guards blocked my attempt of speaking/pointing icecream dripping with a finger afterward the hand shaking. No matter what context is, shaking hands with one of the greatest presidents is a memorable thing. Yet it was not my cherishing moment because of icecreaming dripping and scary bodyguards. My most cherishing moment of meeting famous people is the half an hour conversation with late Prof. Leo Breinman (click for my two postings about him), author of probability textbook, creator of CART, and the most forefront pioneer in machine learning.

The conclusion of that conversation was a book soon to be published after explaining him my ideas of applying statistics to astronomical data and his advices to each problems. I was not capable to understand every statistics so that his answer about this new coming book at that time was the most relevant and apt one.

This conversation happened during the 3rd Statistical Challenges in Modern Astronomy (SCMA). Not long passed since I began my graduate study in statistics but had an opportunity to assist the conference organizer, my advisor Dr. Babu and to do some chores during the conference. By accident, I read the book by Murtagh about multivariate data analysis, so I wanted to speak to him. Except that, I have no desire to speak renown speakers and attendees. Frankly, I didn’t have any idea who’s who at the conference and a few years later, I realized that the conference dragged many famous people and the density of such people was higher than any conference I attended. Who would have imagine that I could have a personal conversation with Prof. Breiman, at that time. I have seen enough that many famous professors train people during conferences. Getting a chance for chatting some seconds are really hard and tall/strong people push someone small like me away always.

The story goes like this: a sunny perfect early summer afternoon, he was taking a break for a cigar and I finished my errands for the session. Not much to do until the end of session, I decided to take some fresh air and I spotted him enjoying his cigar. Only the worst was that I didn’t know he was the person of CART and the founder of statistical machine learning. Only from his talk from the previous session, I learned he was a statistician, who did data mining on galaxies. So, I asked him if I can join him and ask some questions related to some ideas that I have. One topic I wanted to talk about classification of SN light curves, by that time from astronomical text books, there are Type I & II, and Type I has subcategories, Ia, Ib, and Ic. Later, I heard that there is Type III. But the challenge is observations didn’t happen with equal intervals. There were more data mining topics and the conversation went a while. In the end, he recommended me a book which will be published soon.

Having such a story, a privilege of talking to late Prof. Breiman through an very unique meeting, SCMA, before knowing the fame of the book, this book became one of my favorites. The book, indeed, become popular, around that time, almost only book discussing statistical learning; therefore, it was an excellent textbook for introducing statistics to engineerers and machine learning to statisticians. In the mean time, statistical learning enjoyed popularity in many disciplines that have data sets and urging for learning with the aid of machine. Now books and journals on machine learning, data mining, and knowledge discovery (KDD) became prosperous. I was so delighted to see the 2nd edition in the market to bridge the gap over the years.

I thank him for sharing his cigar time, probably his short free but precious time for contemplation, with me. I thank his patience of spending time with such an ignorant girl with a foreign english accent. And I thank him for introducing a book which will became a bible in the statistical learning community within a couple of years (I felt proud of myself that I access the book before people know about it). Perhaps, astronomers cannot have many joys from this book that I experienced from how I encounter the book, who introduced the book, whether the book was used in a course, how often book is referred, etc. But I assure that it’ll narrow the gap in the notions how astronomers think about data mining (preprocessing, pipelining, and bulding catalogs) and how statisticians treat data mining. The newly released 2nd edition would help narrowing the gap further and assist astronomers to coin brilliant learning algorithms specific for astronomical data. [The END]

—————————– Here, I patch my scribbles about the book.

What distinguish this book from other machine learning books is that not only authors are big figures in statistics but also fundamentals of statistics and probability are discussed in all chapters. Most of machine learning books only introduce elementary statistics and probability in chapter 2, and no basics in statistics is discussed in later chapters. Generally, empirical procedures, computer algorithms, and their results without presenting basic theories in statistics are presented.

You might want to check the book’s website for data sets if you want to try some ideas described there
The Elements of Statistical Learning
In addition to its historical footprint in the field of statistical learning, I’m sure that some astronomers want to check out topics in the book. It’ll help to replace some data analysis methods in astronomy celebrating their centennials sooner or later with state of the art methods to cope with modern data.

This new edition reflects some evolutions in statistical learning whereas the first edition has been an excellent harbinger of the field. Pages quoted from the 2nd edition.

[p.28] Suppose in fact that our data arose from a statistical model $Y=f(X)+e$ where the random error e has E(e)=0 and is independent of X. Note that for this model, f(x)=E(Y|X=x) and in fact the conditional distribution Pr(Y|X) depends on X only through the conditional mean f(x).
The additive error model is a useful approximation to the truth. For most systems the input-output pairs (X,Y) will not have deterministic relationship Y=f(X). Generally there will be other unmeasured variables that also contribute to Y, including measurement error. The additive model assumes that we can capture all these departures from a deterministic relationship via the error e.

How statisticians envision “model” and “measurement errors” quite different from astronomers’ “model” and “measurement errors” although in terms of “additive error model” they are matching due to the properties of Gaussian/normal distribution. Still, the dilemma of hen or eggs exists prior to any statistical analysis.

[p.30] Although somewhat less glamorous than the learning paradigm, treating supervised learning as a problem in function approximation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probabilistic inference to be applied to the problem. This is the approach taken in this book.

Strongly recommend to read chapter 3, Linear Methods for Regression: In astronomy, there are so many important coefficients from regression models, from Hubble constant to absorption correction (temperature and magnitude conversion is another example. It seems that these relations can be only explained via OLS (ordinary least square) with the homogeneous error assumption. Yet, books on regressions and linear models are not generally thin. As much diversity exists in datasets, more amount of methodology, theory and assumption exists in order to reflect that diversity. One might like to study the statistical properties of these indicators based on mixture and hierarchical modeling. Some inference, say population proportion can be drawn to verify some hypotheses in cosmology in an indirect way. Understanding what regression analysis and assumptions and how statistician efforts made these methods more robust and interpretable, and reflecting reality would change forcing E(Y|X)=aX+b models onto data showing correlations (not causality).

More on Space Weather

hlee — Tue, 22 Sep 2009 17:03:11 +0000

Thanks to a Korean solar physicist^[1] I was able to gather the following websites and some relevant information on Space Weather Forecast in action, not limited to literature nor toy data.

Space Weather Research Lab at NJIT
SEEDS — Solar Eruptive Event Detection System at George Mason University.
CACTUS A software package for ‘Computer Aided CME Tracking
SRON in the Netherlands

These seem quite informative and I believe more statisticians and data scientists (signal and image processing, machine learning, computer vision, and data mining) easily collaborate with solar physicists. All the complexity, as a matter of fact, comes from data processing to be fed in to (machine, statistical) learning algorithms and defining the objectives of learning. Once settled, one can easily apply numerous methods in the field to these time varying solar images.

I’m writing this short posting because I finally found those interesting articles that I collected for my previous post on Space Weather. After finding them and scanning through, I realized that methodology-wise they only made baby steps. You’ll see a limited number key words are repeated although there is a humongous society of scientists and engineers in the knowledge discovery and data mining.

Note that the objectives of these studies are quite similar. They described machine learning for the purpose of automatizing the procedure of detecting features of interest of the Sun and possible forecasting relevant phenomena that affects our own atmosphere due to associated solar activities.

Automated Prediction of CMEs Using Machine Learning of CME – Flare Associations by Qahwaji et al. (2008) in Solar Phy. vol 248, pp.471-483.
Automatic Short-Term Solar Flare Prediction using Machine Learning and Sunspot Associations by Qahwaji and Colak (2007) in Solar Phy. vol. 241, pp. 195-211

Space weather is defined by the U.S. National Space Weather Probram (NSWP) as “conditions on the Sun and in the solar wind, magnetosphere, ionosphere, and thermosphere that can influence the performance and reliability of space-borne and ground-based technological systems and can endanger human life or health”

Personally thinking, the section of “jackknife” needs to be replaced with “cross-validation.”
Automatic Detection and Classification of Coronal Mass Ejections by Qu et al. (2006) in Solar Phy. vol. 237, pp.419-431.
Automatic Solar Filament Detection Using image Processing Techniques by Qu et al. (2005) in Solar Phy., vol. 228, pp. 119-135
Automatic Solar Flare Tracking Using Image-Processing Techniques by Qu, et al. (2004) in Solar Phy. vol. 222, pp. 137-149
Automatic Solar Flare Detection Using MLP, RBF, and SVM by Qu et al. (2003) in Solar Phy. vol. 217, pp.157-172. pp. 157-172

I’d like add a survey paper on another type of learning methods beyond Support Vector Machine (SVM) used in almost all articles above. Luckily, this survey paper happened to address my concern about the “practices of background subtraction” in high energy astrophysics.

A Survey of Manifold-Based Learning methods by Huo, Ni, Smith
[Excerpt] What is Manifold-Based Learning?
It is an emerging and promising approach in nonparametric dimension reduction. The article reviewed principle component analysis, multidimensional scaling (MDS), generative topological mapping (GTM), locally linear embedding (LLE), ISOMAP, Laplacian eigenmaps, Hessian eigenmaps, and local tangent space alignment (LTSA) Apart from these revisits and comparison, this survey paper is useful to understand the danger of background subtraction. Homogeneity does not mean constant background to be subtracted, often cause negative source observation.

More collaborations among multiple disciplines are desired in this relatively new field. For me, it is one of the best data and information scientific fields of the 21st century and any progress will be beneficial to human kind.

I must acknowledge him for his kindness and patience. He was my wikipedia to questions while I was studying the Sun.

[ArXiv] Cross Validation

hlee — Wed, 12 Aug 2009 23:03:43 +0000

Statistical Resampling Methods are rather unfamiliar among astronomers. Bootstrapping can be an exception but I felt like it’s still unrepresented. Seeing an recent review paper on cross validation from [arXiv] which describes basic notions in theoretical statistics, I couldn’t resist mentioning it here. Cross validation has been used in various statistical fields such as classification, density estimation, model selection, regression, to name a few.

[arXiv:math.ST:0907.4728]
A survey of cross validation procedures for model selection by Sylvain Arlot

Nonetheless, I’ll not review the paper itself except some quotes:

-CV is a popular strategy for model selection, and algorithm selection.
-Compared to the resubstitution error, CV avoids overfitting because the training sample is independent from the validation sample.
-A noticed in the early 30s by Larson (1931), training an algorithm and evaluating its statistical performance on the same data yields an overoptimistic results.

There are books on statistical resampling methods covering more general topics, not limited to model selection. Instead, I decide to do a little search how CV is used in astronomy. These are the ADS search results. More publications than I expected.

Kernel regression for determining photometric redshifts from Sloan broad-band photometry [arXiv:0706.2704]
Wang, D.; Zhang, Y. X.; Liu, C.; Zhao, Y. H.
Monthly Notices of the Royal Astronomical Society, Volume 382, Issue 4, pp. 1601-1606 (2007)
STECKMAP: STEllar Content and Kinematics from high resolution galactic spectra via Maximum A Posteriori [arXiv:0507002]
Ocvirk, P.; Pichon, C.; Lançon, A.; Thiébaut, E.
Monthly Notices of the Royal Astronomical Society, Volume 365, Issue 1, pp. 74-84 (2006)
STECMAP: STEllar Content from high-resolution galactic spectra via Maximum A Posteriori [arXiv:0505209]
Ocvirk, P.; Pichon, C.; Lançon, A.; Thiébaut, E.
Monthly Notices of the Royal Astronomical Society, Volume 365, Issue 1, pp. 46-73 (2006)
Automated Detection of Classical Novae with Neural Networks [arXiv:0604236]
Feeney, S. M et al.
The Astronomical Journal, Volume 130, Issue 1, pp. 84-94 (2005)
Estimation of regularization parameters in multiple-image deblurring [arxiv:0405545]
Vio, R.et al.
Astronomy and Astrophysics, v.423, p.1179-1186 (2004)
Machine learning and image analysis for morphological galaxy classification
de la Calleja, Jorge and Fuentes, Olac
Monthly Notices of the Royal Astronomical Society, Volume 349, Issue 4, pp. 87-93 (2004)
Ensembles of Classifiers for Morphological Galaxy Classification
Bazell, D.; Aha, David W.
The Astrophysical Journal, Volume 548, Issue 1, pp. 219-223.(2001)
Bayesian image reconstruction with space-variant noise suppression
Nunez, J.; Llacer, J.
Astronomy and Astrophysics Supplement, v.131, p.167-180 (1998)
Estimating the sun’s rotation from solar oscillations by regularisation
Thompson, A. M.
Astronomy and Astrophysics (ISSN 0004-6361), vol. 265, no. 1, p. 289-295. (1992)

One can easily grasp that many adopted CV under the machine learning context. The application of CV, and bootstrapping is not limited to machine learning. As Arlot’s title, CV is used for model selection. When it come to model selection in high energy astrophysics, not CV but reduced chi^2 measures and fitted curve eye balling are the standard procedure. Hopefully, a renovated model selection procedure via CV or other statistically robust strategy soon challenge the reduced chi^2 and eye balling. On the other hand, I doubt that it’ll come soon. Remember, eyes are the best classifier so it won’t be a easy task.

space weather

hlee — Thu, 21 May 2009 22:55:26 +0000

Among billion objects in our Galaxy, outside the Earth, our Sun drags most attention from astronomers. These astronomers go by solar physicists, who enjoy the most abundant data including 400 year long sunspot counts. Their joy is not only originated from the fascinating, active, and unpredictable characteristics of the Sun but also attributed to its influence on our daily lives. Related to the latter, sometimes studying the conditions on the Sun is called space weather forecast.

With my limited knowledge, I cannot lay out all important aspects in solar physics, climate changes (not limited to our lower atmosphere but covering the space between the sun and the earth) due to solar activities, and the most important issues of recent years related to space weather. Only I can emphasize that compared to earth climate/atmosphere or meteorology, contribution from statisticians to space weather is almost none existing. I’ve witnessed frequently that crude eyeballing instead of statistics in analyzing data and quantifying images occurs in Solar Physics. Luckily, a few articles discussing statistics are found and my discussion is rather focused on these papers while leaving a room for solar physicists to chip in how space weather is dealt statistically for collaborating with statisticians.

By the way, I have no intention of degrading “eyeballing” in data analysis by astronomers. Statistical methods under EDA, exploratory data analysis whose counterpart is CDA, confirmatory data analysis, or statistical inference, is basically “eyeballing” with technical jargon and basics from probability theory. EDA is important to doubt every step in astronomers’ chi-square methods. Without those diagnostics and visualization, choosing right statistical strategies is almost impossible with real data sets. I used “crude” because instead of using “edge detection” algorithms, edges are drawn by hand via eyeballing. Also, my another disclaimer is that there are brilliant image processing/computer vision strategies developed by astronomers, which I’m not going to present. I’m focusing on small areas in statistics related to space weather and its forecasting.

Statistical Assessment of Photospheric Magnetic Features in Imminent Solar Flare Predictions by Song et al. (2009) SoPh. v. 254, p.101.

Their forte is “logistic regression” a statistical model that is not often used in astronomy. It is seen when modeling binary responses (or categorical responses like head or tail; agree, neutral, or disgree) and bunch of predictors, i.e. classification with multiple features or variables (astronomers might like to replace these lexicons with parameters). Also, the issue of variable selection is discussed like L_{gnl} to be the most powerful predictor. Their training set was carefully discussed from the solar physical perspective. Against their claim that they used “logistic regression” to predict solar flares for the first time, there was another paper a few years back discussing “logistic regression” to predict geomagnetic storms or coronal mass ejections. This statement can be wrong if flares and CMEs are exclusive events.

The Challenge of Predicting the Occurrence of Intense Storms by Srivastava (2006) J.Astrophys. Astr. v.27, pp.237-242

Probability of the storm occurrence is response in logistic regression model, of which predictors are CME related variables including latitude and longitude of the origin of CME, and interplanetary inputs like shock speeds, ram pressure, and solar wind related measures. Cross-validation was performed. A comment that the initial speed of a CME might be the most reliable predictor is given but no extensive discussion of variable selection/model selection.

Personally speaking, both publications^[1] can be more statistically rigorous to discuss various challenges in logistic regression from the statistical learning/classification perspective and from the model/variable selection aspect to define more well behaving and statistically rigorous classifiers.

Often times we plan our days according to the weather forecast (although we grumble weather forecasts are not right, almost everyone relies on numbers and predictions from weather people). Although it may not be 100% reliable, those forecasts make our lives easier. Also, more reliable models are under developing. On the other hand, forecasting space weather with the help of statistics is yet unthinkable. However, scientists and engineers understand that the reliable space weather models help planning space missions and controlling satellites into safety mode. At least I know is that with the presence of flare or CME forecasting models, fewer scientists/engineers need to wake up in the middle of night, because of, otherwise unforeseen storms from the sun.

I thought I collected more papers under “statistics” and “space weather,” not just these two. A few more probably are buried somewhere. It’s hard to believe such rich field is not touched by statisticians. I’d appreciate very much your kind forwarding those relevant papers. I’ll gradually add them.

An excerpt from …

hlee — Thu, 26 Feb 2009 20:07:13 +0000

I’ve been complaining about how one can do machine learning on solar images without a training set? (see my comment at the big picture). On the other hand, I’m also aware of challenges in astronomy that data (images) cannot be transformed freely and be fed into standard machine learning algorithms. Tailoring data pipelining, cleaning, and processing to currently existing vision algorithms may not be achievable. The hope of automatizing the detection/identification procedure of interesting features (e.g. flares and loops) and forecasting events on the surface of the Sun is only a dream. Even though the level of image data stream is that of tsunami, we might have to depend on human eyes to comb out interesting features on the Sun until the new paradigm of automatized feature identification algorithms based on a single image i.e. without a training set. The good news is that human eyes have done a superb job!

From A Survey of the Statistical Theory of Shape by David G. Kendall, Statistical Science, Vol. 4, No. 2 (May, 1989), pp. 87-99.
It is well known that no classical test for two dimensional stochastic point processes can match the performance of the human eye and brain in detecting the presence of improbably large holes in the realized pattern of points. This fact has generated a great deal of research in the last few years, especially in connection with the large “voids” and long “strings” that the eye sees (or declares that it sees) in maps of the Shane and Wirtanen catalogue of positions of galaxies. Astronomers are interested in (i)whether these phenomena are sufficiently extreme to require explanation, and if so (ii) whether any of the various “model” universes now in vsgue can be said to display them to just the same degree.

[Book] pattern recognition and machine learning

hlee — Tue, 16 Sep 2008 19:20:43 +0000

A nice book by Christopher Bishop.
While I was reading abstracts and papers from astro-ph, I saw many applications of algorithms from pattern recognition and machine learning (PRML). The frequency will increase as large scale survey projects numerate, where recommending a good textbook or a reference in the field seems timely.

Survey and population studies generally invite large data sets. Any discussion about individual objects from that survey is an indication that those objects are outliers with respect to the objects in the catalog, created from survey and population studies. These outliers are the objects deserving strong spotlights, in contrast to the notion that outliers are useless. Other than studies about outliers, survey and population studies generally involve machine learning and pattern recognition, or supervised learning and unsupervised learning, or classification and clustering, or statistical learning. Whatever jargon you choose to use, the book overviews most popular machine learning methods extensively with examples, nice illustrations, and concise math. Upon understanding characteristics of the catalog such as dimensions, sample size, independent variable, dependent variable, missing values, sampling (volume limited, magnitude limited, incompleteness), measurement errors, scatter plots, and so on, as a second step to summarize the large data as a whole, the book could offer proper approaches based on your data analysis objective in a statistical sense – in terms of summarizing data.

Click here to access the book website for various resources including a few book chapters, retailer links, examples, and solutions. One of reviews you can check.

A lesson from reading arxiv/astro-ph during the past year is that astronomers must become interdisciplinary particularly those in surveys and creating catalogs. From the information retrieval viewpoint, some rudimentary education about pattern recognition and machine learning is a must as I personally think basic statistics and probability theory should be offered to young astronomers (like astrostatistics summer school at Penn State). While attending graduate school, I saw non stat majors taking statistics classes, except students from astronomy or physics. To confirm this hypothesis, I took computational physics to learn how astronomers and physicists handle data with uncertainty. Although it was one of my favorite classes, the course was quite off from statistics. (Game theory was the most statistically relevant subject.) Hence, I think not many astronomy departments offer practical statistics courses or machine learning and therefore, recommending good and modern textbooks related to (statistical) data analysis can be beneficial to self teaching astronomers. I hope my reasoning is in the right track.

[ArXiv] 5th week, Apr. 2008

hlee — Mon, 05 May 2008 07:08:42 +0000

Since I learned Hubble’s tuning fork^[1] for the first time, I wanted to do classification (semi-supervised learning seems more suitable) galaxies based on their features (colors and spectra), instead of labor intensive human eye classification. Ironically, at that time I didn’t know there is a field of computer science called machine learning nor statistics which do such studies. Upon switching to statistics with a hope of understanding statistical packages implemented in IRAF and IDL, and learning better the contents of Numerical Recipes and Bevington’s book, the ignorance was not the enemy, but the accessibility of data was.

I’m glad to see this week presented a paper that I had dreamed of many years ago in addition to other interesting papers. Nowadays, I’m more and more realizing that astronomical machine learning is not simple as what we see from machine learning and statistical computation literature, which typically adopted data sets from the data repository whose characteristics are well known over the many years (for example, the famous iris data; there are toy data sets and mock catalogs, no shortage of data sets of public characteristics). As the long list of authors indicates, machine learning on astronomical massive data sets are never meant to be a little girl’s dream. With a bit of my sentiment, I offer the list of this week:

[astro-ph:0804.4068] S. Pires et al.
FASTLens (FAst STatistics for weak Lensing) : Fast method for Weak Lensing Statistics and map making
[astro-ph:0804.4142] M.Kowalski et al.
Improved Cosmological Constraints from New, Old and Combined Supernova Datasets
[astro-ph:0804.4219] M. Bazarghan and R. Gupta
Automated Classification of Sloan Digital Sky Survey (SDSS) Stellar Spectra using Artificial Neural Networks
[gr-qc:0804.4144]E. L. Robinson, J. D. Romano, A. Vecchio
Search for a stochastic gravitational-wave signal in the second round of the Mock LISA Data challenges
[astro-ph:0804.4483]C. Lintott et al.
Galaxy Zoo : Morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey
[astro-ph:0804.4692] M. J. Martinez Gonzalez et al.
PCA detection and denoising of Zeeman signatures in stellar polarised spectra
[astro-ph:0805.0101] J. Ireland et al.
Multiresolution analysis of active region magnetic structure and its correlation with the Mt. Wilson classification and flaring activity

A relevant post related machine learning on galaxy morphology from the slog is found at svm and galaxy morphological classification

< Added: 3rd week May 2008>[astro-ph:0805.2612] S. P. Bamford et al.
Galaxy Zoo: the independence of morphology and colour

Wikipedia link: Hubble sequence

[ArXiv] 4th week, Apr. 2008

hlee — Sun, 27 Apr 2008 15:29:48 +0000

The last paper in the list discusses MCMC for time series analysis, applied to sunspot data. There are six additional papers about statistics and data analysis from the week.

[astro-ph:0804.2904]M. Cruz et al.
The CMB cold spot: texture, cluster or void?

[astro-ph:0804.2917] Z. Zhu, M. Sereno
Testing the DGP model with gravitational lensing statistics

[astro-ph:0804.3390] Valkenburg, Krauss, & Hamann
Effects of Prior Assumptions on Bayesian Estimates of Inflation Parameters, and the expected Gravitational Waves Signal from Inflation

[astro-ph:0804.3413] N.Ball et al.
Robust Machine Learning Applied to Astronomical Datasets III: Probabilistic Photometric Redshifts for Galaxies and Quasars in the SDSS and GALEX (Another related publication [astro-ph:0804.3417])

[astro-ph:0804.3471] M. Cirasuolo et al.
A new measurement of the evolving near-infrared galaxy luminosity function out to z~4: a continuing challenge to theoretical models of galaxy formation

[astro-ph:0804.3475] A.D. Mackey et al.
Multiple stellar populations in three rich Large Magellanic Cloud star clusters

[stat.ME:0804.3853] C. R\”over , R. Meyer, N. Christensen
Modelling coloured noise (MCMC & sunspot data)

Signal Processing and Bootstrap

hlee — Wed, 30 Jan 2008 06:33:25 +0000

Astronomers have developed their ways of processing signals almost independent to but sometimes collaboratively with engineers, although the fundamental of signal processing is same: extracting information. Doubtlessly, these two parallel roads of astronomers’ and engineers’ have been pointing opposite directions: one toward the sky and the other to the earth. Nevertheless, without an intensive argument, we could say that somewhat statistics has played the medium of signal processing for both scientists and engineers. This particular issue of IEEE signal processing magazine may shed lights for astronomers interested in signal processing and statistics outside the astronomical society.

IEEE Signal Processing Magazine Jul. 2007 Vol 24 Issue 4: Bootstrap methods in signal processing

This link will show the table of contents and provide links to articles; however, the access to papers requires IEEE Xplore subscription via libraries or individual IEEE memberships). Here, I’d like to attempt to introduce some articles and tutorials.

Special topic on bootstrap:
The guest editors (A.M. Zoubir & D.R. Iskander)^[1] open the issue by providing the rationale, the occasional invalid Gaussian noise assumption, and the consequential complex modeling in their editorial opening, Bootstrap Methods in Signal Processing. A practical approach has been Monte Carlo simulations but the cost of repeating experiments is problematic. The suggested alternative is the bootstrap, which provides tools for designing detectors for various signals subject to noise or interference from unknown distributions. It is said that the bootstrap is a computer-intensive tool for answering inferential questions and this issue serves as tutorials that introduce this computationally intensive statistical method to the signal processing community.

The first tutorial is written by those two guest editors: Bootstrap Methods and Applications, which begins with the list of bootstrap methods and emphasizes its resilience. It discusses the number of bootstrap samples to compensate a simulation (Monte Carlo) error to a statistical error and the sampling methods for dependent data with real examples. The flowchart from Fig. 9 provides the guideline for how to use the bootstrap methods as a summary.

The title of the second tutorial is Jackknifing Multitaper Spectrum Estimates (D.J. Thomson), which introduces the jackknife, multitaper estimates of spectra, and applying the former to the latter with real data sets. The author added the reason for his preference of jackknife to bootstrap and discussed the underline assumptions on resampling methods.

Instead of listing all articles from the special issue, a few astrostatistically notable articles are chosen:

Bootstrap-Inspired Techniques in Computational Intelligence (R. Polikar) explains the bootstrap for estimating errors, algorithms of bagging, boosting, and AdaBoost, and other bootstrap inspired techniques in ensemble systems with a discussion of missing.
Bootstrap for Empirical Multifractal Analysis (H. Wendt, P. Abry & S. Jaffard) explains block bootstrap methods for dependent data, bootstrap confidence limits, bootstrap hypothesis testing in addition to multifractal analysis. Due to the personal lack of familiarity in wavelet leaders, instead of paraphrasing, the article’s conclusion is intentionally replaced with quoting sentences:

First, besides being mathematically well-grounded with respect to multifractal analysis, wavelet leaders exhibit significantly enhanced statistical performance compared to wavelet coefficients. … Second, bootstrap procedures provide practitioners with satisfactory confidence limits and hypothesis test p-values for multifractal parameters. Third, the computationally cheap percentile method achieves already excellent performance for both confidence limits and tests.

Wild Bootstrap Test (J. Franke & S. Halim) discusses the residual-based nonparametric tests and the wild bootstrap for regression models, applicable to signal/image analysis. Their test checks the differences between two irregular signals/images.
Nonparametric Estimates of Biological Transducer Functions (D.H.Foster & K.Zychaluk) I like the part where they discuss generalized linear model (GLM) that is useful to expend the techniques of model fitting/model estimation in astronomy beyond gaussian and least square. They also mentioned that the bootstrap is simpler for getting confidence intervals.
Bootstrap Particle Filtering (J.V.Candy) It is a very pleasant reading for Bayesian signal processing and particle filter. It overviews MCMC and state space model, and explains resampling as a remedy to overcome the shortcomings of importance sampling in signal processing.
Compressive sensing. (R.G.Baranuik)

A lecture note presents a new method to capture and represent compressible signals at a rate significantly below the Nyquist rate. This method employs nonadaptive linear projections that preserve the structure of the signal;

I do wish this brief summary assists you selecting a few interesting articles.

They wrote a book, the bootstrap and its application in signal processing.

On-line Machine Learning Lectures and Notes

hlee — Thu, 03 Jan 2008 18:44:14 +0000

I found this website a while ago but haven’t checked until now. They are quite useful by its contents (even pages of the lecture notes are properly flipped for you while the lecture is given). Increasing popularity of machine learning among astronomers will find more use of such lectures. If you have time to learn machine learning and other related subjects, please visit http://videolectures.net/. Specifically classified links to interesting subjects are found by your click.

Mathematics:
Mathematics>Operations Research (lectures by Gene Golub, Professor at Stanford and Lieven Vandenberghe, one of the authors of Convex Optimzation – a link to the pdf file)
Mathematics>Statistics (including Peter Bickel, Professor at UC Berkeley).

Computer Science:
Computer Science>Bioinformatics
Computer Science>Data Mining
Computer Science>Data Visualisation
Computer Science>Image Analysis
Computer Science>Information Extraction
Computer Science>Information Retrieval
Computer Science>Machine Learning
Computer Science>Machine Learning>Bayesian Learning
Computer Science>Machine Learning>Clustering
Computer Science>Machine Learning>Neural Networks
Computer Science>Machine Learning>Pattern Recognition
Computer Science>Machine Learning>Principal Component Analysis
Computer Science>Machine Learning>Semi-supervised Learning
Computer Science>Machine Learning>Statistical Learning
Computer Science>Machine Learning>Unsupervised learning

Physics:
Physics (You’ll see Randall Smith)

[In the near future, some selected lectures with summary note might be suggested; nevertheless, your recommendations are mostly welcome.]