data-mining

How to plot/visualize a C50 decision tree in R?

和自甴很熟 提交于 2019-12-05 05:12:47
I am using the C50 decision tree algorithm. I am able to build the tree and get the summaries, but cannot figure out how to plot or viz the tree. My C50 model is called credit_model In other decision tree packages, I usually use something like plot(credit_model). In rpart it is rpart.plot(credit_model). What is the equivalent in the C50 algorithm to plot? Right now, there are none built in. I've been working on an adapter for the partykit package (e.g. as.party ) but have not gotten very far. Max You can use the following routine, to directly convert the decision tree into GraphViz dot

Supermarket dataset for Apriori algorithm

白昼怎懂夜的黑 提交于 2019-12-05 01:30:52
问题 'I have to develop a software which is meant for Business Analyst of “Future Stores” Supermarket, the software performs the Association Rule Mining on given transitional data of supermarket sales transactions and prepares Discounting policy by preparing Combo. The software makes use of the data mining algorithms namely Apriori Algorithm. The Association Rules will be displayed in User friendly manner for generation of discounting policy based on positive association rules.' From where can I

Cluster quality measures

守給你的承諾、 提交于 2019-12-04 23:17:49
问题 Does Matlab provide any facility for evaluating clustering methods? (cluster compactness and cluster separation. ....) Or is there any toolbox for it? 回答1: Not in Matlab, but ELKI (Java) provides a dozen or so cluster quality measures for evaluation. 回答2: Matlab provides Silhouette index and there is a toolbox CVAP: Cluster Validity Analysis Platform for Matlab. Which includes following validity indexes: Davies-Bouldin Calinski-Harabasz Dunn index R-squared index Hubert-Levin (C-index)

Calculate similarity between list of words

不羁岁月 提交于 2019-12-04 22:05:20
I want to calculate the similarity between two list of words, for example : ['email','user','this','email','address','customer'] is similar to this list: ['email','mail','address','netmail'] I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list. Since you haven't really been able to demonstrate a crystal output, here is my best shot: list_A = ['email','user','this','email','address','customer'] list_B = ['email','mail','address','netmail'] In the above two list, we will find the cosine similarity between

Newbie: where to start given a problem to predict future success or not

ぃ、小莉子 提交于 2019-12-04 21:36:29
We have had a production web based product that allows users to make predictions about the future value (or demand) of goods, the historical data contains about 100k examples, each example has about 5 parameters; Consider a class of data called a prediciton: prediction { id: int predictor: int predictionDate: date predictedProductId: int predictedDirection: byte (0 for decrease, 1 for increase) valueAtPrediciton: float } and a paired result class that measures the result of the prediction: predictionResult { id: int valueTenDaysAfterPrediction: float valueTwentyDaysAfterPrediction: float

What is Big Data & What classifies as Big data? [closed]

霸气de小男生 提交于 2019-12-04 21:26:15
Closed . This question is opinion-based . It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post . Closed 3 years ago . I have went through a lot of articles but I dont seem to get a perfectly clear answer on what exactly a BIG DATA is. In one page I saw "any data which is bigger for your usage, is big data i.e. 100 MB is considered big data for your mailbox but not your hard disc". Whereas another article said "big data to be usually more than 1 TB with different volume / variety / velocity

Data Sets For Data Mining Tasks [closed]

巧了我就是萌 提交于 2019-12-04 19:45:44
I am relatively new in the field of Data Mining. I am currently doing Some Data preprocessing algorithms such as PCA and min max Normalization . Our professor said we could download the data sets available over the web. But at initial level I want a simple data set with relatively small number of attributes for my algorithm, and would then switch to various complex data sets. Can anyone provide a link for simple data sets which you must have used in your data mining algorithms? e.g. something pertaining to marks of students, age, height etc or employee data of a company. Any assistance would

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

烈酒焚心 提交于 2019-12-04 17:24:37
问题 I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following error: Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) Can you help me? Thank you! 回答1: kmeans cannot handle data that has NA values. The mean and variance are then no longer well defined, and you don't know anymore which center is closest. 回答2: Error in do_one(nmeth) : NA/NaN/Inf

How to score a linear model using PMML file and Augustus on Python

白昼怎懂夜的黑 提交于 2019-12-04 15:16:54
I am new to python,PMML and augustus,so this question kind of newbie.I have a PMML file from which i want to score after every new iteration of data. I have to use Python with Augustus only to complete this excercise. I have read various articles some of them worth mentioning as they are good. ( http://augustusdocs.appspot.com/docs/v06/model_abstraction/augustus_and_pmml.html , http://augustus.googlecode.com/svn-history/r191/trunk/augustus/modellib/regression/producer/Producer.py ) I have read augustus documentation relevent to scoring to understand how it works,but i am unable to solve this

Calculate ordering of dendrogram leaves

江枫思渺然 提交于 2019-12-04 15:03:04
I have five points and I need to create dendrogram from these. The function 'dendrogram' can be used to find the ordering of these points as shown below. However, I do not want to use dendrogram as it is slow and result in error for large number of points (I asked this question here Python alternate way to find dendrogram ). Can someone points me how to convert the 'linkage' output (Z) to the "dendrogram(Z)['ivl']" value. >>> from hcluster import pdist, linkage, dendrogram >>> import numpy >>> from numpy.random import rand >>> x = rand(5,3) >>> Y = pdist(x) >>> Z = linkage(Y) >>> Z array([[ 1.