data-mining

Choosing classification algorithm to classify mix of nominal and numeric data?

北城以北 提交于 2019-12-03 03:51:42
I have a dataset of about 100,000 records about buying pattern of customers. The data set contains Age (continuous value from 2 to 120) but I have plan also to categorize into age ranges. Gender (either 0 or 1) Address (can be only six types or I can also represent using numbers from 1 to 6) Preference shop (can be from only 7 shops) which is my class problem. So my problem is to classify and predict the customers based on their Age,gender and location for Preference shop. I have tried to use naive and decision trees but their classification accuracy is little bit low below. I am thinking also

Numeric example of the Expectation Maximization Algorithm [duplicate]

喜你入骨 提交于 2019-12-03 03:27:08
This question already has answers here : What is an intuitive explanation of the Expectation Maximization technique? [closed] (8 answers) Could anyone provide a simple numeric example of the EM algorithm as I am not sure about the formulas given? A really simple one with 4 or 5 Cartesian coordinates would perfectly do. what about this: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Expectation_Maximization_(EM)#A_simple_example I had also written a simple example in (edit)R a year ago, unfortunately I am unable to locate it. I'll try again to find it later. EDIT: Here it

What is the difference between a Confusion Matrix and Contingency Table?

懵懂的女人 提交于 2019-12-03 03:17:16
I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements of cluster kj . But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which best describes the matrix I want to use? SpeedBirdNine Wikipedia's definition : In the field of

Hierarchical clustering of 1 million objects

心不动则不痛 提交于 2019-12-03 02:23:55
问题 Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange. hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed). I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10. 回答1: To beat O(n^2), you'll have to first reduce your 1M points (documents) to e.g. 1000 piles

how to determine the number of topics for LDA?

走远了吗. 提交于 2019-12-03 02:17:22
I am a freshman in LDA and I want to use it in my work. However, some problems appear. In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T). My question is what does the "a series of" mean? Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of

Writing rules generated by Apriori

让人想犯罪 __ 提交于 2019-12-03 02:06:32
I'm working with some large transactions data. I've been using read.transactions and apriori (parts of the arules package) to mine for frequent item pairings. My problem is this: when rules are generated (using "inspect()") I can easily view them in the R console. Right now I'm manually copying the results into a text file, then saving and opening in excel. I'd like to just save the generated rules using write.csv, or something similar, but when I try, I receive an error that the data cannot be coerced into data.frame. Does anyone have experience doing this successfully in R? I know I'm

Datamining open source software alternatives [closed]

和自甴很熟 提交于 2019-12-03 01:55:39
I am evaluating datamining packages. I have find these two so far: RapidMiner Weka Do you have any experience to share with these two products, or any other product to recommend me? Thanks According to the yearly KDnuggets Polls 2007, 2008, and 2009, RapidMiner is the most widely used Open Source Data Mining Solution among data mining experts world-wide: KDnuggets Data Mining Tool Poll 2009 RapidMiner is open source and 100% Java, RapidMiner is much more flexible and offers significantly more functionality than Weka and KNIME. Regarding SVM implementations: Weka comes with one such

How can I find the center of a cluster of data points?

我是研究僧i 提交于 2019-12-03 01:42:56
问题 Let's say I plotted the position of a helicopter every day for the past year and came up with the following map: Any human looking at this would be able to tell me that this helicopter is based out of Chicago. How can I find the same result in code? I'm looking for something like this: $geoCodeArray = array([GET=http://pastebin.com/grVsbgL9]); function findHome($geoCodeArray) { // magic return $geoCode; } Ultimately generating something like this: UPDATE: Sample Dataset Here's a map with a

Decision tree vs. Naive Bayes classifier [closed]

淺唱寂寞╮ 提交于 2019-12-03 01:34:24
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I am doing some research about different data mining techniques and came across something that I could not figure out. If any one have any idea that would be great. In which cases is it better to use a Decision tree and other cases a Naive Bayes classifier? Why use one of them in certain cases? And the other in

Outlier detection in data mining [closed]

别等时光非礼了梦想. 提交于 2019-12-03 01:13:54
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I have a few sets of questions regarding outlier detection: Can we find outliers using k-means and is this a good approach? Is there any clustering algorithm which does not accept any input from the user? Can we use support vector machine or any other supervised learning algorithm for outlier detection? What are