data-mining | 易学教程

Principal Component Analysis on Weka

阅读更多关于 Principal Component Analysis on Weka

I have just computed PCA on a training set and Weka returned me the new attributes with the way in which they were selected and computed. Now, I want to build a model using these data and then use the model on a test set. Do you know if there is a way to automatically modify the test set according to the new type of attributes? Do you need the principal components for analysis or just to feed into the classifier? If not just use the Meta->FilteredClassifier classifier. Set the filter to PrincipalComponents and and the classifier to whatever classifier you want to use. Train it on the un

Writing rules generated by Apriori

阅读更多关于 Writing rules generated by Apriori

问题 I'm working with some large transactions data. I've been using read.transactions and apriori (parts of the arules package) to mine for frequent item pairings. My problem is this: when rules are generated (using "inspect()") I can easily view them in the R console. Right now I'm manually copying the results into a text file, then saving and opening in excel. I'd like to just save the generated rules using write.csv, or something similar, but when I try, I receive an error that the data cannot

What is the difference between a Confusion Matrix and Contingency Table?

阅读更多关于 What is the difference between a Confusion Matrix and Contingency Table?

问题 I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements of cluster kj . But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which

Better text documents clustering than tf/idf and cosine similarity?

阅读更多关于 Better text documents clustering than tf/idf and cosine similarity?

问题 I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad. The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences: 1- The website Stackoverflow is a nice place

How to analyse a sparse adjacency matrix?

阅读更多关于 How to analyse a sparse adjacency matrix?

I am researching sparse adjacency matrices where most cells are zeros and some ones here-and-there, each relationship between two cells has a polynomial description that can be very long and their analysis manually time-consuming. My instructor is suggesting purely algebraic method in terms of Gröbner bases but before proceeding I would like to know from purely computer science and programming perspective about how to analyse sparse adjacency matrices? Does there exist some data mining tools to analyse them? hhh Multivariate polynomial computation and Gröbner basis is an active research area.

Information Gain Calculation for a text file?

阅读更多关于 Information Gain Calculation for a text file?

问题 I'm working on "text categorization using Information gain,PCA and Genetic Algorithm" But after performing Preprocessing (Stemming, stopword removal, TFIDF) on the document m confused how to move ahead for information gain part. my out file contain word and there TFIDF value. like WORD - TFIDF VALUE together(word) - 0.235(tfidf value) come(word) - 0.2548(tfidf value) when using weka for information gain (" InfoGainAttributeEval.java ") it require .arff file format as input. Is there any to

beginner question on investigating on samples in Weka

阅读更多关于 beginner question on investigating on samples in Weka

I've just used Weka to train my SVM classifier under "Classify" tag. Now I want to further investigate which data samples are mis-classified,I need to study their pattern,but I don't know where to look at this from Weka. Could anyone give me some help please? Thanks in advance. You can enable the option from: You will get the following instance predictions: === Predictions on test split === inst# actual predicted error prediction 1 2:Iris-ver 2:Iris-ver 0.667 ... 16 3:Iris-vir 2:Iris-ver + 0.667 EDIT As I explained in the comments, you can use the StratifiedRemoveFolds filter to manually split

Sentence to Word Table with R

阅读更多关于 Sentence to Word Table with R

I have some sentences, from the sentences I want to separate the words to get row vector each. But the words are repeating to match with the largest sentence's row vector that I do not want. I want no matter how large the sentence is, the row vector of each of the sentences will only be the words one time. sentence <- c("case sweden", "meeting minutes ht board meeting st march now also attachment added agenda today s board meeting", "draft meeting minutes board meeting final meeting minutes ht board meeting rd april") sentence <- cbind(sentence) word_table <- do.call(rbind, strsplit(as

Implementation of k-means clustering algorithm

阅读更多关于 Implementation of k-means clustering algorithm

In my program, i'm taking k=2 for k-mean algorithm i.e i want only 2 clusters. I have implemented in a very simple and straightforward way, still i'm unable to understand why my program is getting into infinite loop. can anyone please guide me where i'm making a mistake..? for simplicity, i hav taken the input in the program code itself. here is my code : import java.io.*; import java.lang.*; class Kmean { public static void main(String args[]) { int N=9; int arr[]={2,4,10,12,3,20,30,11,25}; // initial data int i,m1,m2,a,b,n=0; boolean flag=true; float sum1=0,sum2=0; a=arr[0];b=arr[1]; m1=a;

Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

阅读更多关于 Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner) I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html . But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks