SoDDA is a soft-clustering scheme for classifying galaxies using the 4 emission-line ratios [NII]/Ha, [SII]/Ha, [OI]/Ha, and [OIII]/Hb. It fits several multivariate Gaussians to the 4-D distribution of observed data to capture local structures, which are then grouped to represent the complex multi-dimensional structure of joint distribution of galaxies in the 4-D line ratio space.
A guide to the contents:
The classification.py file contains the following sample functions : 1) compute_prob(x, weights, means, covars, clusters = clusters) : x : 1x4 numpy array containing the ratios LOG(NII/H_ALPHA), LOG(SII/H_ALPHA), LOG(OI/H_ALPHA), LOG(OIII/H_BETA) respectively weights: the weights of the 20 subpopulations, contained in the supplementary material means: the means of the 20 subpopulations, contained in the supplementary material covers: the covariances of the 20 subpopulations, contained in the supplementary material clusters: the allocation of the 20 subpopulations RETURNS: the posterior probabilities of a galaxy belonging to each activity class, SFG, Seyferts, LINERs and Composites. Example: import scipy.stats as stats import numpy as np import pandas as pd means = np.load("means.npy") weights = np.load("weights.npy") covars = np.load("covars.npy") data = pd.DataFrame(np.array([299491051364706304, 6, -0.525441, -0.556073, -1.623533, -0.621178]).reshape(1,6), columns = ["SPECOBJID", "Index", "LOG(NII/H_ALPHA)", "LOG(SII/H_ALPHA)", "LOG(OI/H_ALPHA)", "LOG(OIII/H_BETA)"]) posterior_prob = compute_prob(data.iloc[0, 2:], weights, means, covars) 2) svm_classification_4d(x, svc_4d_coef, svc_4d_inter): x : 1x4 numpy array containing the ratios LOG(NII/H_ALPHA), LOG(SII/H_ALPHA), LOG(OI/H_ALPHA), LOG(OIII/H_BETA) respectively svc_4d_coef: the 4-dimensional SVM coefficients, contained in the supplementary material svc_4d_inter: The 4-dimensional SVM intercepts, contained in the supplementary material RETURNS: the classification of a galaxy according to 4-dimensional SVM (0 for SFG, 1 for Seyferts, 2 for LINERs, 3 for Composites, 5 for undefined; Note for the undefined case: we used the scikit library for training the SVM and the one-vs-rest , `ovr', decision function. This kind of decision function can lead to specific regions in the 4-dimensional space that are not covered by any of the 4 classes. Scikit approach for those points are to classify based on the distance to the boundaries. The trained SVM model can be used for this purpose which is also included in the online version as svm_4d.sav. Example: import numpy as np from sklearn import svm import pickle data = pd.DataFrame(np.array([299491051364706304, 6, -0.525441, -0.556073, -1.623533, -0.621178]).reshape(1,6), columns = ["SPECOBJID", "Index", "LOG(NII/H_ALPHA)", "LOG(SII/H_ALPHA)", "LOG(OI/H_ALPHA)", "LOG(OIII/H_BETA)"]) svc_4d_inter = np.load("svm_4d_intercept.npy") svc_4d_coef = np.load("svm_4d_coefs.npy") predicted_class = svm_classification_4d(data.iloc[0, 2:], svc_4d_coef, svc_4d_inter) print("class based on 4d SVM = " + str(predicted_class)) # Use the trained classifier filename = 'svm_4d.sav' svc = pickle.load(open(filename, 'rb')) predicted_class = svc.predict(data.iloc[0, 2:].values.reshape(1,4)) print("class based on 4d trained SVM classifier = " + str(predicted_class)) 3) svm_classification_3d(x, svc_3d_coef, svc_3d_inter): x : 1x3 numpy array containing the ratios LOG(NII/H_ALPHA), LOG(SII/H_ALPHA), LOG(OIII/H_BETA) respectively svc_3d_coef: the 3-dimensional SVM coefficients, contained in the supplementary material svc_3d_inter: The 3-dimensional SVM intercepts, contained in the supplementary material RETURNS: the classification of a galaxy according to 3-dimensional SVM (0 for SFG, 1 for Seyferts, 2 for LINERs, 3 for Composites, 5 for undefined (see svm_classification_4d for an explanation of undefined)
The entire set of programs and example datasets and associated subroutines may be downloaded as the tar file, hea-www.harvard.edu/SoDDA/SoDDA.tar.gz