data-mining

Plot the cluster member in r

对着背影说爱祢 提交于 2019-12-03 20:57:54
I use DTW package in R. and I finally finished hierarchical clustering. but I wanna plot time-series cluster separately like below picture. sc <- read.table("D:/handling data/confirm.csv", header=T, sep="," ) rownames(sc) <- sc$STDR_YM_CD sc$STDR_YM_CD <- NULL col_n <- colnames(sc) hc <- hclust(dist(sc), method="average") plot(hc, main="") How can I do it?? My data in http://blogattach.naver.com/e772fb415a6c6ddafd1370417f96e494346a9725/20170207_141_blogfile/khm2963_1486442387926_THgZRt_csv/confirm.csv?type=attachment You can try this: sc <- read.table("confirm.csv", header=T, sep="," )

What is Java Data Mining, JDM?

我们两清 提交于 2019-12-03 17:11:19
问题 I am looking at JDM. Is this simply an API to interact with other tools that do the actual data mining? Or is this a set of packages that contain the actual data mining algorithms? 回答1: Ah, the wonders of the interweb: Java Data Mining (JDM) is a standard Java API for developing data mining applications and tools. JDM defines an object model and Java API for data mining objects and processes. JDM enables applications to integrate data mining technology for developing predictive analytics

Python, web log data mining for frequent patterns

為{幸葍}努か 提交于 2019-12-03 16:51:05
I need to develop a tool for web log data mining. Having many sequences of urls, requested in a particular user session (retrieved from web-application logs), I need to figure out the patterns of usage and groups (clusters) of users of the website. I am new to Data Mining, and now examining Google a lot. Found some useful info, i.e. querying Frequent Pattern Mining in Web Log Data seems to point to almost exactly similar studies. So my questions are: Are there any python-based tools that do what I need or at least smth similar? Can Orange toolkit be of any help? Can reading the book

Computing F-measure for clustering

点点圈 提交于 2019-12-03 16:28:29
Can anyone help me to calculate F-measure collectively ? I know how to calculate recall and precision, but don't know for a given algorithm how to calculate one F-measure value. As an exemple, suppose my algorithm creates m clusters, but I know there are n clusters for the same data (as created by another benchmark algorithm). I found one pdf but it is not useful since the collective value I got is greater than 1. Reference of pdf is F Measure explained . Specifically I have read some research paper, in which the author compares two algorithms on the basis of F-measure, they got collectively

How do I create a new data table in Orange?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-03 15:51:39
I am using Orange (in Python) for some data mining tasks. More specifically, for clustering. Although I have gone through the tutorial and read most of the documentation, I still have a problem. All the examples in docs and tutorials assume that I have a tab delimited table with data in it. However, there is nothing saying how one can go about creating a new table from scratch. For example, I want to create a table for word frequencies across different documents. Maybe I am missing something so if anyone has any insight it'd be appreciated. Thanks George EDIT: This is how I create my table

Supermarket dataset for Apriori algorithm

守給你的承諾、 提交于 2019-12-03 15:36:52
'I have to develop a software which is meant for Business Analyst of “Future Stores” Supermarket, the software performs the Association Rule Mining on given transitional data of supermarket sales transactions and prepares Discounting policy by preparing Combo. The software makes use of the data mining algorithms namely Apriori Algorithm. The Association Rules will be displayed in User friendly manner for generation of discounting policy based on positive association rules.' From where can I get the supermarket dataset to check the Apriori algorithm which i have coded? To get a market dataset,

Numeric example of the Expectation Maximization Algorithm [duplicate]

孤人 提交于 2019-12-03 14:37:50
问题 This question already has answers here : What is an intuitive explanation of the Expectation Maximization technique? [closed] (8 answers) Closed last year . Could anyone provide a simple numeric example of the EM algorithm as I am not sure about the formulas given? A really simple one with 4 or 5 Cartesian coordinates would perfectly do. 回答1: what about this: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Expectation_Maximization_(EM)#A_simple_example I had also written a

Cluster quality measures

不想你离开。 提交于 2019-12-03 14:36:28
Does Matlab provide any facility for evaluating clustering methods? (cluster compactness and cluster separation. ....) Or is there any toolbox for it? Not in Matlab, but ELKI (Java) provides a dozen or so cluster quality measures for evaluation. Matlab provides Silhouette index and there is a toolbox CVAP: Cluster Validity Analysis Platform for Matlab. Which includes following validity indexes: Davies-Bouldin Calinski-Harabasz Dunn index R-squared index Hubert-Levin (C-index) Krzanowski-Lai index Hartigan index Root-mean-square standard deviation (RMSSTD) index Semi-partial R-squared (SPR)

Choosing classification algorithm to classify mix of nominal and numeric data?

孤街醉人 提交于 2019-12-03 13:56:31
问题 I have a dataset of about 100,000 records about buying pattern of customers. The data set contains Age (continuous value from 2 to 120) but I have plan also to categorize into age ranges. Gender (either 0 or 1) Address (can be only six types or I can also represent using numbers from 1 to 6) Preference shop (can be from only 7 shops) which is my class problem. So my problem is to classify and predict the customers based on their Age,gender and location for Preference shop. I have tried to use

R: unclear behaviour of tuneRF function (randomForest package)

大兔子大兔子 提交于 2019-12-03 13:06:09
问题 I feel uncomfortable with the meaning of the stepFactor parameter of the tuneRF function which is used for tuning the mtry parameter used further in the randomForest function. The documentation of tuneRF says that stepFactor is a magnitude by which the chosen mtry gets deflated or inflated. Obviously, since mtry is a number of variables chosen randomly, it has to be an integer, however I saw many examples on the net using stepFactor=1.5 . At first I thought that R uses by default next mtry