data-mining | 易学教程

How to use weka for predict results

阅读更多关于 How to use weka for predict results

问题 Im new to weka and I'm confused with the tool. What I needed to do is im having a data set about fruit price and relating attributes and im trying to predict the specific fruit price using the data set. Since I'm new to weka I couldn't figure out how to do this task. Please help me or guide me to a tutorial about how to do predictions and what is the best method or the algorithm to do this task. Thank You. 回答1: If you want to know more about how to save a trained classifier and load the same

Cosine distance as vector distance function for k-means

阅读更多关于 Cosine distance as vector distance function for k-means

I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited. E.g. for the graph: the vector: v1 = {100, 50, 0 30, 0} would mean that we spent: 100secs at vertex 1 50secs at vertex 2 and 30secs at vertex 4 (vertices 3 & 5 where not visited, thus the 0s). I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine

how to determine the number of topics for LDA?

阅读更多关于 how to determine the number of topics for LDA?

问题 I am a freshman in LDA and I want to use it in my work. However, some problems appear. In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T). My question is what does the "a series of" mean? 回答1: Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge,

Datamining open source software alternatives [closed]

阅读更多关于 Datamining open source software alternatives [closed]

问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I am evaluating datamining packages. I have find these two so far: RapidMiner Weka Do you have any experience to share with these two

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

阅读更多关于 Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following error: Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) Can you help me? Thank you! kmeans cannot handle data that has NA values. The mean and variance are then no longer well defined, and you don't know anymore which center is closest. Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) This error occurs also due to non numeric values present in the table. all of

what is the difference between Association rule mining & frequent itemset mining

阅读更多关于 what is the difference between Association rule mining & frequent itemset mining

问题 i am new to data mining and confuse about Association rules and frequent item mining. for me i think both are same but i need views from experts on this forum My question is what is the difference between Association rule mining & frequent itemset mining? Thanks 回答1: An association rule is something like "A,B → C", meaning that C tends to occur when A and B occur. An itemset is just a collection such as "A,B,C", and it is frequent if its items tend to co-occur. The usual way to look for

What data mining tools do you use? [closed]

阅读更多关于 What data mining tools do you use? [closed]

Besides the two well-known Open Source tools RapidMiner and Weka, are there any other good tools (either Open Source or Commercial), which you can recommend for data mining? Thanks in advance! My money is on R , see e.g. the Machine Learning task view. How about the open source Orange data mining toolkit. http://www.ailab.si/orange/ You can look at my project - Data Mining SDK . According to the KDnuggets Poll 2011, RapidMiner once more is the most widely used data mining solution world-wide: http://www.kdnuggets.com/2011/05/tools-used-analytics-data-mining.html If it is commercial software

Is Triangle inequality necessary for kmeans?

阅读更多关于 Is Triangle inequality necessary for kmeans?

I wonder if Triangle inequality is necessary for the distance measure used in kmeans. k-means is designed for Euclidean distance, which happens to satisfy triangle inequality. Using other distance functions is risky, as it may stop converging . The reason however is not the triangle inequality, but the mean might not minimize the distance function . (The arithmetic mean minimizes the sum-of-squares, not arbitrary distances!) There are faster methods for k-means that exploit the triangle inequality to avoid recomputations. But if you stick to classic MacQueen or Lloyd k-means, then you do not

TFIDF calculating confusion

阅读更多关于 TFIDF calculating confusion

I found the following code on the internet for calculating TFIDF: https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error: return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList)))) But i am confused for two things: I get negative values in some cases, is this correct? I am confused with line 62, 63 and 64. Code: documentNumber = 0 for word in documentList[documentNumber].split(None): words[word] = tfidf(word,documentList[documentNumber],documentList) Should TFIDF be

How can HMMs be used for handwriting recognition?

阅读更多关于 How can HMMs be used for handwriting recognition?

The problem is a bit different than traditional handwriting recognition. I have a dataset that are thousands of the following. For one drawn character, I have several sequential (x, y) coordinates where the pen was pressed down. So, this is a sequential (temporal) problem. I want to be able to classify handwritten characters based on this data, and would love to implement HMMs for learning purposes. But, is this the right approach? How can they be used to do this? I think HMM can be used in both problems mentioned by @jens. I'm working on online handwriting too, and HMM is used in many