classification | 易学教程

how to cluster users based on tags

阅读更多关于 how to cluster users based on tags

问题 I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this? Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j? In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program). I would like to expect at the end k number of clusters (maybe a

MATLAB - Classification output

阅读更多关于 MATLAB - Classification output

问题 My programme uses K-means clustering of a set amount of clusters from the user. For this k=4 but I would like to run the clustered information through matlabs naive bayes classifier afterwards. Is there a way to split the clusters up and feed them into different naive classifiers in matlab? Naive Bayes: class = classify(test,training, target_class, 'diaglinear'); K-means: %% generate sample data K = 4; numObservarations = 5000; dimensions = 42; %% cluster opts = statset('MaxIter', 500,

Closure axiom for instances so that reasoner can correctly classify instances in ontology

阅读更多关于 Closure axiom for instances so that reasoner can correctly classify instances in ontology

问题 I am a geographer and a new comer in the field of ontology trying to make sense out of these two. Therefore, I have created a simple ontology, as follows: Thing Feature Lane Segment(equivalent to Arc) Geometry Arc (equivalent to Segment) Node Dangling_Node Intersection_node you can find the .owl file here instantiated with a very simple spatial road dataset (fig1). the ontology is consistent without and with instances, but when I run the reasoner, the Dangling_node instances (nodes that are

combination of smote and undersampling on weka

阅读更多关于 combination of smote and undersampling on weka

问题 according to paper which written by chawla, et al (2002) the best perfomance of balancing data is combining undersampling with SMOTE. I’ve tried to combine my dataset using under-sampling and SMOTE, but I am bit confuse about the attribute for under-sampling. In weka there is Resample to decrease the majority class. there is a attribute in Resample biasToUniformClass -- Whether to use bias towards a uniform class. A value of 0 leaves the class distribution as-is, a value of 1 ensures the

How to disable the console output in libsvm (java)

阅读更多关于 How to disable the console output in libsvm (java)

问题 I am using libsvm in java and am experiencing similar issues as described here for python. I am getting a lot of console output during training and prediction and would like to disable it. Sadly, due to a "Service Temporary Unavaiable" I can't access the website, where it might be described (here). I couldn't find a java related way to disable this warnings (If I did oversee something I apologize) The Output always looks quite similar to this: optimization finished, #iter = 10000000 nu = 0

ValueError: setting an array element with a sequence in scikit-learn (sklearn) using GaussianNB

阅读更多关于 ValueError: setting an array element with a sequence in scikit-learn (sklearn) using GaussianNB

问题 I am trying to make a sklearn image classifier but I am unable to fit the data into a classifier. x_train = np.array(im_matrix) y_train = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] clf = GaussianNB() clf.fit(x_train, y_train) at clf.fit(x_train, y_train) I get following error: ValueError: setting an array element with a sequence. im_matrix is an array holding image matrices: for file in files: path = os.path.join(root, file) im_matrix.append(mpimg.imread(path)) shape of x_train is (10, 1) shape of y

How to set intercept_scaling in scikit-learn LogisticRegression

阅读更多关于 How to set intercept_scaling in scikit-learn LogisticRegression

问题 I am using scikit-learn's LogisticRegression object for regularized binary classification. I've read the documentation on intercept_scaling but I don't understand how to choose this value intelligently. The datasets look like this: 10-20 features, 300-500 replicates Highly non-Gaussian, in fact most observations are zeros The output classes are not necessarily equally likely. In some cases they are almost 50/50, in other cases they are more like 90/10. Typically C=0.001 gives good cross

Weka - differences between Explorer and Experimenter outcomes

阅读更多关于 Weka - differences between Explorer and Experimenter outcomes

问题 I just wondered why is the % correctly classified differs from the Explorer and Experimenter aspects of Weka. I have checked to ensure I am employing 10-cross fold validation as well as all other paramaters! Anyone have any ideas? Thanks 回答1: I have the solution, as provided by Mark Hall, as I emailed him on the Weka Mail list. Here is the difference between Explorer and Experimenter: The Experimenter operates differently from the Explorer. The Explorer sums evaluation metrics over the folds

is it neccessary to run random forest with cross validation at the same time

阅读更多关于 is it neccessary to run random forest with cross validation at the same time

问题 Random forest is a robust algorithm. In Random Forest, it trains several small trees and have OOB accuracy. However, is it necessary to run cross-validation with random forest at the same time ? 回答1: OOB error is an unbiased estimate of the error for random forests, so that's great. But what are you using the cross validation for? If you are comparing the RF against some other algorithm that isn't using bagging in the same way, you want a low variance way to compare them. You have to use

Discrete and Continuous Classifier on Sparse Data

阅读更多关于 Discrete and Continuous Classifier on Sparse Data

问题 I'm trying to classify an example, which contains discrete and continuous features. Also, the example represents sparse data, so even though the system may have been trained on 100 features, the example may only have 12. What would be the best classifier algorithm to use to accomplish this? I've been looking at Bayes, Maxent, Decision Tree, and KNN, but I'm not sure any fit the bill exactly. The biggest sticking point I've found is that most implementations don't support sparse data sets and