The AstroStat Slog » model

[MADS] logistic regression

hlee — Tue, 13 Oct 2009 20:15:08 +0000

Although a bit of time has elapsed since my post space weather, saying that logistic regression is used for prediction, it looks like still true that logistic regression is rarely used in astronomy. Otherwise, it could have been used for the similar purpose not under the same statistical jargon but under the Bayesian modeling procedures.

Maybe, some astronomers want to check out this versatile statistical method, wiki:logistic regression to see whether they can fit their data to this statistical method in order to model/predict observation rates, unobserved rates, undetected rates, detected rates, absorbed rates, and so on in terms of what are observed and additional/external observations, knowledge, and theories. I wonder what would it be like if the following is fit using logistic regression: detection limits, Eddington bias, incompleteness, absorption, differential emission measures, opacity, etc plus brute force Monte Carlo simulations emulating likely data to be fit. Then, responses are the probability of observed vs not observed as a function of redshift, magnitudes, counts, flux, wavelength/frequency, and other measurable variables or latent variables.

My simple reasoning that astronomers observe partially and they will never have complete sample, has imposed a prejudice that logistic regression would appear in astronomical literature rather frequently. Against my bet, it was [MADS]. All stat softwares have packages and modules for logistic regression; therefore, you have a data set, application is very straight forward.

———————————[added]
Although logistic regression models are given in many good tutorials, literature, or websites, it might be useful to have a simple but intuitive form of logistic regression for sloggers.

When you have binary responses, metal poor star (Y=1) vs. metal rich star (Y=2), and predictors, such as colors, distance, parallax, precision, and other columns in catalogs (X is a matrix comprised of these variables),
.
As astronomers fit a linear regression model to get the intercept and slope, the same approach is applied to get intercepts and coefficients of logistic regression models.

[MADS] multiscale modeling

hlee — Thu, 11 Dec 2008 19:46:05 +0000

A few scientists in our group work on estimating the intensities of gamma ray observations from sky surveys. This work distinguishes from typical image processing which mostly concerns the point estimation of intensity at each pixel location and the size of overall white noise type error. Often times you will notice from image processing that the orthogonality between errors and sources, and the white noise assumptions. These assumptions are typical features in image processing utilities and modules. On the other hand, CHASC scientists relate more general and broad statistical inference problems in estimating the intensity map, like intensity uncertainties at each point and the scientifically informative display of the intensity map with uncertainty according to the Poisson count model and constraints from physics and the instrument, where the field, multiscale modeling is associated.

As the post title [MADS] indicates, no abstract has keywords multiscale modeling. It seems like that just the jargon is not listed in ADS since “multiscale modeling” is practiced in astronomy. One of examples is our group’s work. Those CHASC scientists take Bayesian modeling approaches, which makes them unique to my knowledge in the astronomical society. However, I expected constructing an intensity map through statistical inference (estimation) or “multiscale modeling” to be popular among astronomers in recent years. Well, none came along from my abstract keyword search.

Wikipedia also shows a very brief description of multiscale modeling and emphasized that it is a fairly new interdisciplinary topic. wiki:multiscale_modeling. TomLoredo kindly informed me some relevant references from ADS after my post [MADS] HMM. He mentioned his search words were Markov Random Fields which can be found from stochastic geometry and spatial statistics in addition to many applications in computer science. Not only these publications but he gave me a nice comment on analyzing astronomical data, which I’d rather postpone for another discussion.

Quantifying Doubt and Confidence in Image “Deconvolution” by Connors, Alanna; van Dyk, D.; Chiang, J.; CHASC
Blind Bayesian restoration of adaptive optics telescope images using generalized Gaussian Markov random field models by Jeffs, Brian D.; Christou, Julian C.
Segmenting Chromospheric Images with Markov Random Fields (paper in SCMA II) Turmon, Michael J.; Pap, Judit M.
Bayesian deconvolution methods in astronomy by Molina, R.; Katsaggelos, A. K.; Mateos, J
Compound Gauss-Markov random fields for astronomical image restoration by Molina, R.; Katsaggelos, A. K.; Mateos, J.; Abad, J
Markov random field applications in image analysis by Jain, A. K.; Nadabar, S. G (I bet “Jain” is the author of many celebrated papers in image processing and machine learning. I often find that well known computer scientists involve in astronomical researches ).

The reason I was not able to find these papers was that they are not published in the 4 major astronomical publications + Solar Physics. The reason for this limited search is that I was overwhelmed by the amount of unlimited search results including arxiv. (I wonder if there is a way to do exclusive searches in ADS by excluding arxiv:comp, arxiv:phys, arxiv:math, etc). Thank you, Tom, for providing me these references.

Please, check out CHASC website for more study results related to “multiscale modeling” from our group.

[Added] Nice tutorials related to Markov Random Fields (MRF) recommended by an expert in the field and a friend (all are pdfs).

A Quote on Model

hlee — Wed, 08 Oct 2008 05:31:46 +0000

In order to understand a learning procedure statistically it is necessary to identify two important aspects: its structural model and its error model. The former is most important since it determines the function space of the approximator, thereby characterizing the class of functions or hypothesis that can be accurately approximated with it. The error model specifies the distribution of random departures of sampled data from the structural model.

From Additive logistic regression: a statistical view of boosting by J.Friedman, T. Hastie, and R. Tibshirani (2000) Ann. Stat. Vol. 28(2), pp.337-407.

I believe, structural models represent relations among parameters and variables like mixture models, generalized linear models, Bayesian hierarchical models, and so on. Error models are generally marginalized to describe data fluctuations from the given structural model. For astronomers, structural models are often driven from physics and only error models are built on statistics, where confusion came in when a communication between statistician and astronomer occurs about models. Without verification, I saw too often times that simple Gaussian error models are almost always adopted in astronomy. Error models, in general, are not standalone but associated with structural models.

We know that encyclopedias of statistical models exist to explain both structural and error models. Additional but small efforts of the statistically apprehensible quantification of astronomical structure models and associated errors would lead cornucopia of error models beyond simple Gaussian error model. I’m not saying adopting Gaussian is improper. Multivariate normal assumption serves well in many statistical data analysis problems and estimators under normal distribution are efficient. What I like to emphasize is that statistics has built useful models and strategies beyond Gaussian error model which properly account for non Gaussian cases foreseen by various exploratory data analysis tools.

All models are wrong, but some are useful

hlee — Tue, 01 Jul 2008 03:12:23 +0000

All models are wrong, but some are useful. –George Box

One of the most frequently cited quotes appeared in an article, titled The End of Theory: The Data Deluge Makes the Scientific Method Obsolete which I liked it very much because it cited the updated maxim by Peter Norvig, Google’s research director,

All models are wrong, and increasingly you can succeed without them.

The article addressed perspectives of the new Petabyte data analysis era, where the traditional modeling and testing are not likely feasible.

I’d like to thank the person who forwarded this article. However, I have no intention of advertising the company in the article by your click and reading. At least, I’d like to urge that we need more innovative thinkings than what we normally do with small data sets described by the author, Chris Anderson:

The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

I cannot put it in an elegant fashion but simply, the data analysis should be directed by listening data and letting data talk to you, instead of framing models onto data (particularly when the data set is large or humongous; good a priori knowledge might be an exception but we never had enough where disputes of errors come in).

Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

[ArXiv] 1st week, Apr. 2008

hlee — Sun, 06 Apr 2008 15:10:15 +0000

I’m very curious how astronomers began to use Monte Carlo Markov Chain instead of Markov chain Monte Carlo. The more it becomes popular, the more frequently Monte Carlo Markov Chain appears. Anyway, this week, I added non astrostatistical papers in the list: a tutorial, big bang, and biblical theology.

[astro-ph:0803.4089] R. Trotta
Bayes in the sky: Bayesian inference and model selection in cosmology (Bayesian cosmology tutorial).
[astro-ph:0804.0070] W. Cui et al.
An ideal mass assignment scheme for measuring the Power Spectrum with FFTs
[astro-ph:0804.0155] L. Wang et al.
Timeline analysis and wavelet multiscale analysis of the AKARI All-Sky Survey at 90 micron
[astro-ph:0804.0278]L. Colombo and E. Pierpaoli
Model independent approaches to reionization in the analysis of upcoming CMB data
[astro-ph:0804.0285]L. Vergani et al.
Dark Matter – Dark Energy coupling biasing parameter estimates from CMB data
[astro-ph:0804.0294] A. Romeo et al.
Discreteness Effects in Lambda Cold Dark Matter Simulations: A Wavelet-Statistical View
[astro-ph:0804.0373] F. Schmidt et al.
Weak Lensing Effects on the Galaxy Three-Point Correlation Function
[astro-ph:0804.0382] R. U. Abbasi et al.
Search for Correlations between HiRes Stereo Events and Active Galactic Nuclei
[astro-ph:0804.0543] M. Schmalzl et al.
The Initial Mass Function of the Stellar Association NGC 602 in the Small Magellanic Cloud with Hubble Space Telescope ACS Observations

gravitational microlensing tutorial? [astro-ph:0803.4324]
Recent Developments in Gravitational Microlensing by A. Gould

paper with a very interesting title: [astro-ph:0803.3604]
Was There A Big Bang? by R. K. Soberman and M. Dubin

not astrostatistics but atypical statistical application, interesting topic, and good discussions:[stat.AP:0804.0079]
Statistical analysis of an archeological find by A. Feuerverger
Discussants are S.M. Stigler, C. Fuchs, D.L. Bentley, S.M. Bird, H. Höfling, L. Wasserman, R. Ingermanson, J. Mortera, P. Vicard, J.B. Kadane (Click names).

language barrier

hlee — Wed, 13 Feb 2008 20:41:32 +0000

Last week, I was at Tufts colloquium and happened to have a conversation with a computer scientist about density based clustering. I understood density as probabilistic density and was recollecting a paper by Fraley and Raftery (Model-Based Clustering, Discriminant Analysis, and Density Estimation, JASA, 2002, 97, p.458) and other similar papers I saw in engineering journals like IEEE transactions. For a few moments, I felt uncomfortable and she explained that density meant “how dense observations are.” Density based clustering was meant to be distance based clustering, like k-means, minimum spanning tree, most likely nonparametric approaches.

Although words are same, the first impression and their usage is quite different from society to society (even among statisticians). One word I’m very reluctant to use both to astronomers and statisticians is model. I’m quite confused at the reactions from both sides. To clarify meanings, implications, or intentions, some clever adjectives must accompany these common words; however, once one gets used to these jargons, adjectives are felt redundant to your fellow scientists/colleagues, whereas the other gets lost and seeks explanation of the usage by related examples and backgrounds.

Not only simple words, like model and density, there are more jargons requires inter-disciplinary semantic experts. Yet, patience of explaining and open-mindedness would easily assist to get over language barriers in any interdisciplinary works.

[ Would you mind sharing your experience of any language barrier? ]

model vs model

vlk — Fri, 05 Oct 2007 17:38:23 +0000

As Alanna pointed out, astronomers and statisticians mean different things when they say “model”. To complicate matters, we have also started to use another term called “data model”.

First, there is the physical model, which could mean either our understanding of what processes operate on a source (the physics part, usually involving PDEs), or the mathematical function that describes the emission as a function of observables like location, time, or energy (the astronomy part, usually the shape of the spectrum, or the time evolution in a light curve, etc.)

The data model on the other hand describes the organization of the observation. It is this which tells us that there is a fundamental difference between an effective area and a response matrix, and conversely, that the point spread function and the line response function are the same beast. This kind of thing, which I suppose is a computer science oriented view of the contents of a file, is crucial for implementing and running something like the Virtual Observatory.