classification

How to rank the instances based on prediction probability in sklearn

与世无争的帅哥 提交于 2019-12-08 14:00:40
I am using sklearn's support vector machine ( SVC ) as follows to get the prediction probability of my instances in my dataset as follows using 10-fold cross validation . from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target clf=SVC(class_weight="balanced") proba = cross_val_predict(clf, X, y, cv=10, method='predict_proba') print(clf.classes_) print(proba[:,1]) print(np.argsort(proba[:,1])) My expected output is as follows for print(proba[:,1]) and print(np.argsort(proba[:,1])) where the first one indicates the prediction probability of all instances for class

How do I specify the positive class in an H2O random forest or other binary classifier?

心已入冬 提交于 2019-12-08 13:16:47
问题 I am building a binary classification model in H2O with Python. My 'y' values are 'ok' and 'bad'. I need the metrics to be computed with ok = negative class = 0 and bad = positive class = 1. However, I do not see any way to set this in H2O. For example here is the output of the predictions and confusion matrix: confusion matrix bad ok Error Rate bad 3859 631 0.1405 (631.0/4490.0) ok 477 1069 0.3085 (477.0/1546.0) Total 4336 1700 0.1836 (1108.0/6036.0) >>> predictions.head(10) predict bad ok 0

How do we enable OpenMP to use multiple cores — glintenet

萝らか妹 提交于 2019-12-08 11:00:08
问题 I want to use glinternet a R function that implements a feature extraction methodology developed by the Stanford professor Trevor Hastie and a PhD student. The function has an argument numCores. According to the user manual: numCores Number of threads to run. For this to work, the package must be installed with OpenMP enabled. Default is 1 thread. I don't know though how to enable OpenMP. I have Windows 8. Your advice will be appreciated. 回答1: Here is the answer of the glinternet package

Classify using GMM with MATLAB

喜你入骨 提交于 2019-12-08 10:42:54
问题 I want to perform classification of two classes using Gaussian Mixture Models with MATLAB. I doing training by creating two models with the function gmdistribution.fit NComponents = 1; for class=1:2 model(class).obj = gmdistribution.fit(trainData(class).feature,NComponents,'Regularize',.1); end Then, given test data points, I want to know how to classify them. What I am doing now is to obtain the posterior probability for each point in each model: vectorClasses = zeros(1,2); for class=1:2 Pos

Label Propagation - Array is too big

笑着哭i 提交于 2019-12-08 08:32:21
问题 I am using label propagation in scikit learn for semi-supervised classification. I have 17,000 data points with 7 dimensions. I am unable to use it on this data set. Its throwing a numpy big array error. However, it works fine when I work on a relatively small data set say 200 points. Can anyone suggestion a fix? label_prop_model.fit(np.array(data), labels) File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 58, in fit graph_matrix = self._build_graph() File "/usr

How to pickle individual steps in sklearn's Pipeline?

﹥>﹥吖頭↗ 提交于 2019-12-08 08:21:49
问题 I am using Pipeline from sklearn to classify text. In this example Pipeline , I have a TfidfVectorizer and some custom features wrapped with FeatureUnion and a classifier as the Pipeline steps, I then fit the training data and do the prediction: from sklearn.pipeline import FeatureUnion, Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC X = ['I am a sentence', 'an example'] Y = [1, 2] X_dev = ['another sentence'] # classifier LinearSVC1 =

SKLearn: Getting distance of each point from decision boundary?

倾然丶 夕夏残阳落幕 提交于 2019-12-08 08:07:31
问题 I am using SKLearn to run SVC on my data. from sklearn import svm svc = svm.SVC(kernel='linear', C=C).fit(X, y) I want to know how I can get the distance of each data point in X from the decision boundary? 回答1: For linear kernel, the decision boundary is y = w * x + b, the distance from point x to the decision boundary is y/||w||. y = svc.decision_function(x) w_norm = np.linalg.norm(svc.coef_) dist = y / w_norm For non-linear kernels, there is no way to get the absolute distance. But you can

Pickling a trained classifier yields different results from the results obtained directly from a newly but identically trained classifier

喜夏-厌秋 提交于 2019-12-08 07:29:04
问题 I'm trying to pickle a trained SVM classifier from the Scikit-learn library so that I don't have to train it over and over again. But when I pass the test data to the classifier loaded from the pickle, I get unusually high values for accuracy, f measure, etc. If the test data is passed directly to the classifier which is not pickled, it gives much lower values. I don't understand why pickling and unpickling the classifier object is changing the way it behaves. Can someone please help me out

At what stage the training exactly takes place in FlannBasedMatcher in OpenCV?

会有一股神秘感。 提交于 2019-12-08 06:56:37
问题 The following code is in C++ and I am using OpenCV for my experiment. Suppose I am using kd-tree (FlannBasedMatcher) in the following way: //these are inputs to the code snippet below. //They are filled with suitable values Mat& queryDescriptors; vector<Training> &trainCollection; vector< vector<DMatch> >& matches; int knn; //setting flann parameters const Ptr<flann::IndexParams>& indexParams=new flann::KDTreeIndexParams(4); const Ptr<flann::SearchParams>& searchParams=new flann::SearchParams

designing classification problem of weather data

梦想与她 提交于 2019-12-08 05:53:25
问题 In normal 2 or multi class classification problem, we can use any famous machine learning algorithm like Naive Bayes or SVM to train and test the model. My problem is that I have been given weather data where the label variable is in the format of "20 % rain, 80 % dry" or "30% cloudy, 70% rain" etc. How should I approach this problem? Will I need to covert the problem into regression somehow? In that case, if there are three labels (rain, dry, cloudy) in data, what may be the right approach