classification | 易学教程

How to rank the instances based on prediction probability in sklearn

阅读更多关于 How to rank the instances based on prediction probability in sklearn

I am using sklearn's support vector machine ( SVC ) as follows to get the prediction probability of my instances in my dataset as follows using 10-fold cross validation . from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target clf=SVC(class_weight="balanced") proba = cross_val_predict(clf, X, y, cv=10, method='predict_proba') print(clf.classes_) print(proba[:,1]) print(np.argsort(proba[:,1])) My expected output is as follows for print(proba[:,1]) and print(np.argsort(proba[:,1])) where the first one indicates the prediction probability of all instances for class

How do I specify the positive class in an H2O random forest or other binary classifier?

阅读更多关于 How do I specify the positive class in an H2O random forest or other binary classifier?

问题 I am building a binary classification model in H2O with Python. My 'y' values are 'ok' and 'bad'. I need the metrics to be computed with ok = negative class = 0 and bad = positive class = 1. However, I do not see any way to set this in H2O. For example here is the output of the predictions and confusion matrix: confusion matrix bad ok Error Rate bad 3859 631 0.1405 (631.0/4490.0) ok 477 1069 0.3085 (477.0/1546.0) Total 4336 1700 0.1836 (1108.0/6036.0) >>> predictions.head(10) predict bad ok 0

How do we enable OpenMP to use multiple cores — glintenet

阅读更多关于 How do we enable OpenMP to use multiple cores — glintenet

问题 I want to use glinternet a R function that implements a feature extraction methodology developed by the Stanford professor Trevor Hastie and a PhD student. The function has an argument numCores. According to the user manual: numCores Number of threads to run. For this to work, the package must be installed with OpenMP enabled. Default is 1 thread. I don't know though how to enable OpenMP. I have Windows 8. Your advice will be appreciated. 回答1: Here is the answer of the glinternet package

Classify using GMM with MATLAB

阅读更多关于 Classify using GMM with MATLAB

问题 I want to perform classification of two classes using Gaussian Mixture Models with MATLAB. I doing training by creating two models with the function gmdistribution.fit NComponents = 1; for class=1:2 model(class).obj = gmdistribution.fit(trainData(class).feature,NComponents,'Regularize',.1); end Then, given test data points, I want to know how to classify them. What I am doing now is to obtain the posterior probability for each point in each model: vectorClasses = zeros(1,2); for class=1:2 Pos

Label Propagation - Array is too big

阅读更多关于 Label Propagation - Array is too big

问题 I am using label propagation in scikit learn for semi-supervised classification. I have 17,000 data points with 7 dimensions. I am unable to use it on this data set. Its throwing a numpy big array error. However, it works fine when I work on a relatively small data set say 200 points. Can anyone suggestion a fix? label_prop_model.fit(np.array(data), labels) File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 58, in fit graph_matrix = self._build_graph() File "/usr

How to pickle individual steps in sklearn's Pipeline?

阅读更多关于 How to pickle individual steps in sklearn's Pipeline?

问题 I am using Pipeline from sklearn to classify text. In this example Pipeline , I have a TfidfVectorizer and some custom features wrapped with FeatureUnion and a classifier as the Pipeline steps, I then fit the training data and do the prediction: from sklearn.pipeline import FeatureUnion, Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC X = ['I am a sentence', 'an example'] Y = [1, 2] X_dev = ['another sentence'] # classifier LinearSVC1 =

SKLearn: Getting distance of each point from decision boundary?

阅读更多关于 SKLearn: Getting distance of each point from decision boundary?

问题 I am using SKLearn to run SVC on my data. from sklearn import svm svc = svm.SVC(kernel='linear', C=C).fit(X, y) I want to know how I can get the distance of each data point in X from the decision boundary? 回答1: For linear kernel, the decision boundary is y = w * x + b, the distance from point x to the decision boundary is y/||w||. y = svc.decision_function(x) w_norm = np.linalg.norm(svc.coef_) dist = y / w_norm For non-linear kernels, there is no way to get the absolute distance. But you can

Pickling a trained classifier yields different results from the results obtained directly from a newly but identically trained classifier

阅读更多关于 Pickling a trained classifier yields different results from the results obtained directly from a newly but identically trained classifier

问题 I'm trying to pickle a trained SVM classifier from the Scikit-learn library so that I don't have to train it over and over again. But when I pass the test data to the classifier loaded from the pickle, I get unusually high values for accuracy, f measure, etc. If the test data is passed directly to the classifier which is not pickled, it gives much lower values. I don't understand why pickling and unpickling the classifier object is changing the way it behaves. Can someone please help me out

At what stage the training exactly takes place in FlannBasedMatcher in OpenCV?

阅读更多关于 At what stage the training exactly takes place in FlannBasedMatcher in OpenCV?

问题 The following code is in C++ and I am using OpenCV for my experiment. Suppose I am using kd-tree (FlannBasedMatcher) in the following way: //these are inputs to the code snippet below. //They are filled with suitable values Mat& queryDescriptors; vector<Training> &trainCollection; vector< vector<DMatch> >& matches; int knn; //setting flann parameters const Ptr<flann::IndexParams>& indexParams=new flann::KDTreeIndexParams(4); const Ptr<flann::SearchParams>& searchParams=new flann::SearchParams

designing classification problem of weather data

阅读更多关于 designing classification problem of weather data

问题 In normal 2 or multi class classification problem, we can use any famous machine learning algorithm like Naive Bayes or SVM to train and test the model. My problem is that I have been given weather data where the label variable is in the format of "20 % rain, 80 % dry" or "30% cloudy, 70% rain" etc. How should I approach this problem? Will I need to covert the problem into regression somehow? In that case, if there are three labels (rain, dry, cloudy) in data, what may be the right approach