classification | 易学教程

Learning decision trees on huge datasets

阅读更多关于 Learning decision trees on huge datasets

问题 I'm trying to build a binary classification decision tree out of huge (i.e. which cannot be stored in memory) datasets using MATLAB. Essentially, what I'm doing is: Collect all the data Try out n decision functions on the data Pick out the best decision function to separate the classes within the data Split the original dataset into 2 Recurse on the splits The data has k attributes and a classification, so it is stored as a matrix with a huge number of rows, and k+1 columns. The decision

How to train Naive Bayes Classifier for n-gram (movie_reviews)

阅读更多关于 How to train Naive Bayes Classifier for n-gram (movie_reviews)

问题 Below is the code of training Naive Bayes Classifier on movie_reviews dataset for unigram model. I want to train and analyze its performance by considering bigram , trigram model. How can we do it. import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews from nltk.corpus import stopwords from nltk.tokenize import word_tokenize def create_word_features(words): useful_words = [word for word in words if word not in stopwords.words("english")]

Random Forest not working in opencv python (cv2)

阅读更多关于 Random Forest not working in opencv python (cv2)

问题 I can't seem to correctly pass in the parameters to train a Random Forest classifier in opencv from python. I wrote an implementation in C++ which worked correctly, but do not get the same results in python. I found some sample code here: http://fossies.org/linux/misc/opencv-2.4.7.tar.gz:a/opencv-2.4.7/samples/python2/letter_recog.py which seems to indicate that you should pass in the parameters in a dict. Here is the code I am using: rtree_params = dict(max_depth=11, min_sample_count=5, use

“Error in table(pred = prediction, true = W[, 8]) : all arguments must have the same length”

阅读更多关于 “Error in table(pred = prediction, true = W[, 8]) : all arguments must have the same length”

问题 this is my data: Anon_Student_Id Problem_Hierarchy Problem_Name Problem_View Number_Of_Steps Sum_Of_Steps_Duration Sum_Of_Hints result 1 80nlN05JQ6 Unit ES_01, Section ES_01-6 EG21 8 3 28 0 1 2 80nlN05JQ6 Unit ES_01, Section ES_01-6 EG21 9 3 37 0 0 3 80nlN05JQ6 Unit ES_01, Section ES_01-6 EG21 10 3 50 0 0 4 80nlN05JQ6 Unit ES_01, Section ES_01-6 EG22 1 3 78 0 0 5 80nlN05JQ6 Unit ES_01, Section ES_01-6 EG22 2 3 41 0 1 6 80nlN05JQ6 Unit ES_01, Section ES_01-6 EG22 3 3 92 0 0 I'm trying to

Model is empty, SVM in e1071 package

阅读更多关于 Model is empty, SVM in e1071 package

问题 I have a matrix of N examples x 765 features. To this, there is a vector of N labels for each of those examples. I am trying to use SVM to classify them and make predictions. It worked in one instance when I was splitting the whole data into training and validation using this a manual half-split: indicator<-1:(length(idx)/2) training <- idx[indicator] test<-idx[-indicator] However, if I try to randomize the halves out of each class in the loop by using this: indicator<-sample(idx, trunc

sklearn classifier get ValueError: bad input shape

阅读更多关于 sklearn classifier get ValueError: bad input shape

问题 I have a csv, struct is CAT1,CAT2,TITLE,URL,CONTENT , CAT1, CAT2, TITLE ,CONTENT are in chinese. I want train LinearSVC or MultinomialNB with X(TITLE) and feature(CAT1,CAT2), both get this error. below is my code: PS: I write below code through this example scikit-learn text_analytics import numpy as np import csv from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline label_list = [] def label_map_target(label): '''

Difference between random forest implementation

阅读更多关于 Difference between random forest implementation

问题 Is there a performance difference between the implementation of Random Forest in H2O and standard Random Forest library? Has anybody performed or done some analysis for these two implementations. 回答1: Here is an open benchmark you can start with. https://github.com/szilard/benchm-ml 回答2: I suppose you are looking for this: http://www.wise.io/tech/benchmarking-random-forest-part-1 来源： https://stackoverflow.com/questions/45190787/difference-between-random-forest-implementation

Extract feature vector from 2d image in numpy

阅读更多关于 Extract feature vector from 2d image in numpy

问题 I have a series of 2d images of two types, either a star or a pentagon. My aim is to classify all of these images respectively. I have 30 star images and 30 pentagon images. An example of each image is shown side by side here: Before I apply the KNN classification algorithm, I need to extract a feature vector from all the images. The feature vectors must all be of the same size however the 2d images all vary in size. I have extracted read in my image and I get back a 2d array with zeros and

Features selection with sequentialfs with libsvm

阅读更多关于 Features selection with sequentialfs with libsvm

问题 I want to use matlab toolbox to do feature selection. there is one good function there called sequentialfs that does a good job. However, I could not integrate it with LibSVM function to perform features selection. It works fine with KnnClassify, can somebody help me please. here is the code for KnnClassify: fun1 = @(XT,yT,Xt,yt)... (sum((yt ~= knnclassify(Xt,XT,yT,5)))); [fs,history] = sequentialfs(fun1,data,label,'cv',c,'options',opts,'direction','forward'); 回答1: You'll need to wrap the

Preprocess large datafile with categorical and continuous features

阅读更多关于 Preprocess large datafile with categorical and continuous features

问题 First thanks for reading me and thanks a lot if you can give any clue to help me solving this. As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional. My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing. In my data I have 24 values : 13 are nominal, 6