data-mining | 易学教程

How to plot/visualize a C50 decision tree in R?

阅读更多关于 How to plot/visualize a C50 decision tree in R?

I am using the C50 decision tree algorithm. I am able to build the tree and get the summaries, but cannot figure out how to plot or viz the tree. My C50 model is called credit_model In other decision tree packages, I usually use something like plot(credit_model). In rpart it is rpart.plot(credit_model). What is the equivalent in the C50 algorithm to plot? Right now, there are none built in. I've been working on an adapter for the partykit package (e.g. as.party ) but have not gotten very far. Max You can use the following routine, to directly convert the decision tree into GraphViz dot

Supermarket dataset for Apriori algorithm

阅读更多关于 Supermarket dataset for Apriori algorithm

问题 'I have to develop a software which is meant for Business Analyst of “Future Stores” Supermarket, the software performs the Association Rule Mining on given transitional data of supermarket sales transactions and prepares Discounting policy by preparing Combo. The software makes use of the data mining algorithms namely Apriori Algorithm. The Association Rules will be displayed in User friendly manner for generation of discounting policy based on positive association rules.' From where can I

Cluster quality measures

阅读更多关于 Cluster quality measures

问题 Does Matlab provide any facility for evaluating clustering methods? (cluster compactness and cluster separation. ....) Or is there any toolbox for it? 回答1: Not in Matlab, but ELKI (Java) provides a dozen or so cluster quality measures for evaluation. 回答2: Matlab provides Silhouette index and there is a toolbox CVAP: Cluster Validity Analysis Platform for Matlab. Which includes following validity indexes: Davies-Bouldin Calinski-Harabasz Dunn index R-squared index Hubert-Levin (C-index)

Calculate similarity between list of words

阅读更多关于 Calculate similarity between list of words

I want to calculate the similarity between two list of words, for example : ['email','user','this','email','address','customer'] is similar to this list: ['email','mail','address','netmail'] I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list. Since you haven't really been able to demonstrate a crystal output, here is my best shot: list_A = ['email','user','this','email','address','customer'] list_B = ['email','mail','address','netmail'] In the above two list, we will find the cosine similarity between

Newbie: where to start given a problem to predict future success or not

阅读更多关于 Newbie: where to start given a problem to predict future success or not

We have had a production web based product that allows users to make predictions about the future value (or demand) of goods, the historical data contains about 100k examples, each example has about 5 parameters; Consider a class of data called a prediciton: prediction { id: int predictor: int predictionDate: date predictedProductId: int predictedDirection: byte (0 for decrease, 1 for increase) valueAtPrediciton: float } and a paired result class that measures the result of the prediction: predictionResult { id: int valueTenDaysAfterPrediction: float valueTwentyDaysAfterPrediction: float

What is Big Data & What classifies as Big data? [closed]

阅读更多关于 What is Big Data & What classifies as Big data? [closed]

Closed . This question is opinion-based . It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post . Closed 3 years ago . I have went through a lot of articles but I dont seem to get a perfectly clear answer on what exactly a BIG DATA is. In one page I saw "any data which is bigger for your usage, is big data i.e. 100 MB is considered big data for your mailbox but not your hard disc". Whereas another article said "big data to be usually more than 1 TB with different volume / variety / velocity

Data Sets For Data Mining Tasks [closed]

阅读更多关于 Data Sets For Data Mining Tasks [closed]

I am relatively new in the field of Data Mining. I am currently doing Some Data preprocessing algorithms such as PCA and min max Normalization . Our professor said we could download the data sets available over the web. But at initial level I want a simple data set with relatively small number of attributes for my algorithm, and would then switch to various complex data sets. Can anyone provide a link for simple data sets which you must have used in your data mining algorithms? e.g. something pertaining to marks of students, age, height etc or employee data of a company. Any assistance would

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

阅读更多关于 Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

问题 I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following error: Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) Can you help me? Thank you! 回答1: kmeans cannot handle data that has NA values. The mean and variance are then no longer well defined, and you don't know anymore which center is closest. 回答2: Error in do_one(nmeth) : NA/NaN/Inf

How to score a linear model using PMML file and Augustus on Python

阅读更多关于 How to score a linear model using PMML file and Augustus on Python

I am new to python,PMML and augustus,so this question kind of newbie.I have a PMML file from which i want to score after every new iteration of data. I have to use Python with Augustus only to complete this excercise. I have read various articles some of them worth mentioning as they are good. ( http://augustusdocs.appspot.com/docs/v06/model_abstraction/augustus_and_pmml.html , http://augustus.googlecode.com/svn-history/r191/trunk/augustus/modellib/regression/producer/Producer.py ) I have read augustus documentation relevent to scoring to understand how it works,but i am unable to solve this

Calculate ordering of dendrogram leaves

阅读更多关于 Calculate ordering of dendrogram leaves

I have five points and I need to create dendrogram from these. The function 'dendrogram' can be used to find the ordering of these points as shown below. However, I do not want to use dendrogram as it is slow and result in error for large number of points (I asked this question here Python alternate way to find dendrogram ). Can someone points me how to convert the 'linkage' output (Z) to the "dendrogram(Z)['ivl']" value. >>> from hcluster import pdist, linkage, dendrogram >>> import numpy >>> from numpy.random import rand >>> x = rand(5,3) >>> Y = pdist(x) >>> Z = linkage(Y) >>> Z array([[ 1.