classification

How to create a dendrogram with colored branches?

这一生的挚爱 提交于 2019-12-11 07:59:13
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 6 years ago . I would like to create a dendrogram in R which has colored branches, like the one shown below. So far I used following commands to create a standard dendrogram: d <- dist(as.matrix(data[,29])) # find distance matrix hc <- hclust(d) # apply hirarchical clustering plot(hc,labels=data[,1], main="", xlab="") # plot the dendrogram How should I modify this code to obtain desired

IBM Watson Visual recognition{“code”:400,“error”:“Cannot execute learning task. : no classifier name given”}

北慕城南 提交于 2019-12-11 07:32:16
问题 When I try to train a classifier with two positive classes and with the API key (each class contains around 1200 images) in Watson Visual Recognition, it returns that "no classifier name is given" - but that I have already provided. This is the code: $ curl -X POST -F "blank_positive_examples=@C:\Users\rahansen\Desktop\Altmuligt\training\no_ocd\no_ocd.zip" -F "OCD_positive_examples=@C:\Users\rahansen\Desktop\Altmuligt\training\ocd\ocd.zip" -F "name=disease" "https://gateway-a.watsonplatform

How to optimize SciKit one-class training time?

廉价感情. 提交于 2019-12-11 07:02:53
问题 Essentially my questions is the same as SciKit One-class SVM classifier training time increases exponentially with size of training data, but no one has figured out the problem. It seems to run fine for somewhere in the 10s of thousands, but 100s of thousands take very long. And I want to run it on 10s of millions, but I don't want to wait a day and a half (maybe even more) for nothing to come of it. Is there a faster way about it, or should I use something else? 回答1: I'm very junior in this

Same form of dataset has 2 different shapes

会有一股神秘感。 提交于 2019-12-11 06:09:47
问题 I am quite new to Machine Learning and am just grasping the techniques. As such, I am trying to train a model on the following classifiers using a dataset that has 4 features and the target feature/class (the truth value 1 or 0 ). Classifiers SGD Classifier Random Forest Classifier Linear Support Vector Classifier Gaussian Process Classifier I am training the model on the following dataset [Part of the dataset is shown below]. Training set : train_sop_truth.csv Subject,Predicate,Object

How to check if pandas dataframe rows have certain values in various columns, scalability

て烟熏妆下的殇ゞ 提交于 2019-12-11 06:05:50
问题 I have implemented the CN2 classification algorithm, it induces rules to classify the data of the form: IF Attribute1 = a AND Attribute4 = b THEN class = class 1 My current implementation loops through a pandas DataFrame containing the training data using the iterrows() function and returns True or False for each row if it satisfies the rule or not, however, I am aware this is a highly inefficient solution. I would like to vectorise the code, my current attempt is like so: DataFrame = df age

Weka decimal precision

匆匆过客 提交于 2019-12-11 05:57:52
问题 After getting very excited by what seemed like excellent results from using the MLP within the Weka GUI on my pricing data, I've coded up a bit of Java that uses an MLP with the same parameters. Here is where the fun starts, the results are completely different, I've now found that this appears to be be due to rounding differences. The GUI rounds to 3 dp, my java code rounds to 5 dp. I've looked through the manuals but I can't seem to find an option to force the GUI to use 5dp precision on

Nltk naive bayesian classifier memory issue

偶尔善良 提交于 2019-12-11 05:47:21
问题 my first post here! I have problems using the nltk NaiveBayesClassifier. I have a training set of 7000 items. Each training item has a description of 2 or 3 worlds and a code. I would like to use the code as label of the class and each world of the description as features. An example: "My name is Obama", 001 ... Training set = {[feature['My']=True,feature['name']=True,feature['is']=True,feature[Obama]=True], 001} Unfortunately, using this approach, the training procedure NaiveBayesClassifier

SGD Classifier partial fit learning with different dimensional input data

醉酒当歌 提交于 2019-12-11 05:45:10
问题 I am trying to perform a SGD classification for one hot encoded data. I did a fit on my training example and want to perform partial_fit on fewer data at later time. I understand the error getting thrown because of dimension change between fit data and partial_fit data. I also understand I need to perform data transform on my hot_new_df but I am unsure how. IN[29] -- is where i am doing a fit() IN[32] -- is where i am doing a partial_fit() I have just presented a hypothetical example here...

plot one of 500 trees in randomForest package

自古美人都是妖i 提交于 2019-12-11 04:54:58
问题 How can plot trees in output of randomForest function in same names packages in R? For example I use iris data and want to plot first tree in 500 output tress. my code is model <-randomForest(Species~.,data=iris,ntree=500) 回答1: You can use the getTree() function in the randomForest package (official guide: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf) On the iris dataset: require(randomForest) data(iris) ## we have a look at the k-th tree in the forest k <- 10 getTree

ROC curve and libsvm

流过昼夜 提交于 2019-12-11 04:30:17
问题 Given a ROC curve drawn with plotroc.m (see here): Theoretical question: How to select the best threshold to be used? Programming qeuestion: How to induce the libsvm classifier to work with the selected (best) threshold? 回答1: ROC curve is plot generated by plotting fraction of true positive on y-axis versus fraction of false positive on x-axis. So, co-ordinates of any point (x,y) on ROC curve indicates FPR and TPR value at particular threshold. As shown in figure, we find the point (x,y) on