classification

Map predictions back to IDs - Python Scikit Learn DecisionTreeClassifier

最后都变了- 提交于 2019-12-06 12:36:09
I have a dataset that has a unique identifier and other features. It looks like this ID LenA TypeA LenB TypeB Diff Score Response 123-456 51 M 101 L 50 0.2 0 234-567 46 S 49 S 3 0.9 1 345-678 87 M 70 M 17 0.7 0 I split it up into training and test data. I am trying to classify test data into two classes from a classifier trained on training data. I want the identifier in the training and testing dataset so I can map the predictions back to the IDs . Is there a way that I can assign the identifier column as a ID or non-predictor like we can do in Azure ML Studio or SAS? I am using the

Retrieve final hidden activation layer output from sklearn's MLPClassifier

不羁的心 提交于 2019-12-06 12:02:59
I would like to do some tests with neural network final hidden activation layer outputs using sklearn's MLPClassifier after fit ting the data. for example, If I create a classifier, assuming data X_train with labels y_train and two hidden layers of sizes (300,100) clf = MLPClassifier(hidden_layer_sizes=(300,100)) clf.fit(X_train,y_train) I would like to be able to call a function somehow to retrieve the final hidden activation layer vector of length 100 for use in additional tests. Assuming a test set X_test, y_test , normal prediction would be: preds = clf.predict(X_test) But, I would like to

Bizarre Behavior of randomForest Package When Dropping One Prediction Class

a 夏天 提交于 2019-12-06 11:41:59
问题 I am running a random forest model that produces results that make absolutely no sense to me from a statistical perspective, and thus I'm convinced that something must be going wrong code-wise with the randomForest package. The predicted / left hand side variable is, in at least this iteration of the model, a party ID with 3 possible outcomes: Democrat, Independent, Republican. I run the model, get results, fine. I'm at this point not super concerned with the results per se, but rather what

What are the steps needed to use Mahout Native Bayes Classifier Algorithm?

Deadly 提交于 2019-12-06 11:23:37
问题 I am trying to use Native Bayes Classifier in detecting fraud transactions. I have a sample data of around 5000 in an excel sheet, this is the data which I will use for training the classifier and i have test data of around 1000 on which I will apply test classifier. Here my problem is, I dont know how to train the classifier. Do I need to transform my training data into some specific format before passing it into training classifier. How the training classifier will know which is my target

Does stemming harm precision in text classification?

泄露秘密 提交于 2019-12-06 07:32:53
问题 I have read stemming harms precision but improves recall in text classification. How does that happen? When you stem you increase the number of matches between the query and the sample documents right? 回答1: It's always the same, if you raise recall, your doing a generalisation. Because of that, you're losing precision. Stemming merge words together. On the one hand, words which ought to be merged together (such as "adhere" and "adhesion") may remain distinct after stemming; on the other,

Preprocess large datafile with categorical and continuous features

强颜欢笑 提交于 2019-12-06 07:28:54
First thanks for reading me and thanks a lot if you can give any clue to help me solving this. As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional. My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing. In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line "RENAULT";"CLIO III";"CLIO III

precision and recall in fastText?

故事扮演 提交于 2019-12-06 06:39:20
问题 I implement the fastText for text classification, link https://github.com/facebookresearch/fastText/blob/master/tutorials/supervised-learning.md I was wondering what's the precision@1, or P@5 means? I did a binary classification, but I tested different number, I don't understand results: haos-mbp:fastText hao$ ./fasttext test trainmodel.bin train.valid 2 N 312 P@2 0.5 R@2 1 Number of examples: 312 haos-mbp:fastText hao$ ./fasttext test trainmodel.bin train.valid 1 N 312 P@1 0.712 R@1 0.712

OpenCV SVM train_auto Insufficient Memory

梦想的初衷 提交于 2019-12-06 06:15:51
This is my first post here so I hope to be able to ask my question properly :-) I want to do "elephant detection" by classifying color samples (I was inspired by this paper ). This is the pipeline of my "solution" until the training of the classifier: Loading a set of 4 training images (all containing an elephant), and then splitting them in two images: one containing the environment surrounding the elephant (the "background"), and one containing the elephant (the "foreground"); Mean shift segmentation of the backgrounds and the foregrounds; RGB -> Luv color space conversion and pixel values

Large scale naïve Bayes classifier with top-k output

你说的曾经没有我的故事 提交于 2019-12-06 05:50:41
I need a library for naïve Bayes large scale, with millions of training examples and +100k binary features. It must be an online version (updatable after training). I also need top-k output, that is multiple classifications for a single instance. Accuracy is not very important. The purpose is an automatic text categorization application. Any suggestions for a good library is very appreciated. EDIT: The library should preferably be in Java. If a learning algorithm other than naïve Bayes is also acceptable, then check out Vowpal Wabbit (C++), which has the reputation of being one of the best

How to balance unbalanced classification 1:1 with SMOTE in R

落爺英雄遲暮 提交于 2019-12-06 05:46:17
问题 I am doing binary classification and my current target class is composed of: Bad: 3126 Good:25038 So I want the number of Bad (minority) examples to equal the number of Good examples (1:1). So Bad needs to increase by ~8x (extra 21912 SMOTEd instances) and not increase the majority (Good). The code I am trying will not keep the number of Good constant, as currently. Code I have tried: Example 1: library(DMwR) smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=0, k=5, learner