decision-tree

How to calculate the threshold value for numeric attributes in Quinlan's C4.5 algorithm?

纵饮孤独 提交于 2019-11-30 14:56:35
I am trying to find how the C4.5 algorithm determines the threshold value for numeric attributes. I have researched and can not understand, in most places I've found this information: The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1,v2, …,vm}. Any threshold value lying between vi and vi+1 will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus

How to count the observations falling in each node of a tree

天涯浪子 提交于 2019-11-30 07:45:17
I am currently dealing with wine data in MMST package. I have split the whole dataset into training and test and build a tree like the following codes: library("rpart") library("gbm") library("randomForest") library("MMST") data(wine) aux <- c(1:178) train_indis <- sample(aux, 142, replace = FALSE) test_indis <- setdiff(aux, train_indis) train <- wine[train_indis,] test <- wine[test_indis,] #### divide the dataset into trainning and testing model.control <- rpart.control(minsplit = 5, xval = 10, cp = 0) fit_wine <- rpart(class ~ MalicAcid + Ash + AlcAsh + Mg + Phenols + Proa + Color + Hue + OD

Can sklearn DecisionTreeClassifier truly work with categorical data?

隐身守侯 提交于 2019-11-30 04:40:05
问题 While working with the DecisionTreeClassifier I visualized it using graphviz, and I have to say, to my astonishment, it seems it takes categorical data and uses it as continuous data. All my features are categorical and for example you can see the following tree (please note that the first feature, X[0], has 6 possible values 0, 1, 2, 3, 4, 5: From what I found here the class uses a tree class which is a binary tree, so it is a limitation in sklearn. Anyone knows a way that I am missing to

confused about random_state in decision tree of scikit learn

好久不见. 提交于 2019-11-30 03:30:54
Confused about random_state parameter, not sure why decision tree training needs some randomness. My thoughts, (1) is it related to random forest? (2) is it related to split training testing data set? If so, why not use training testing split method directly ( http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html )? http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html >>> from sklearn.datasets import load_iris >>> from sklearn.cross_validation import cross_val_score >>> from sklearn.tree import

Dictionary object to decision tree in Pydot

狂风中的少年 提交于 2019-11-29 20:48:18
问题 I have a dictionary object as such: menu = {'dinner':{'chicken':'good','beef':'average','vegetarian':{'tofu':'good','salad':{'caeser':'bad','italian':'average'}},'pork':'bad'}} I'm trying to create a graph (decision tree) using pydot with the 'menu' data this. 'Dinner' would be the top node and its values (chicken, beef, etc.) are below it. Referring to the link, the graph function takes two parameters; a source and a node. It would look something like this: Except 'king' would be 'dinner'

How do I solve overfitting in random forest of Python sklearn?

孤者浪人 提交于 2019-11-29 20:19:27
I am using RandomForestClassifier implemented in python sklearn package to build a binary classification model. The below is the results of cross validations: Fold 1 : Train: 164 Test: 40 Train Accuracy: 0.914634146341 Test Accuracy: 0.55 Fold 2 : Train: 163 Test: 41 Train Accuracy: 0.871165644172 Test Accuracy: 0.707317073171 Fold 3 : Train: 163 Test: 41 Train Accuracy: 0.889570552147 Test Accuracy: 0.585365853659 Fold 4 : Train: 163 Test: 41 Train Accuracy: 0.871165644172 Test Accuracy: 0.756097560976 Fold 5 : Train: 163 Test: 41 Train Accuracy: 0.883435582822 Test Accuracy: 0.512195121951 I

How to compute error rate from a decision tree?

一笑奈何 提交于 2019-11-29 19:52:58
Does anyone know how to calculate the error rate for a decision tree with R? I am using the rpart() function. Assuming you mean computing error rate on the sample used to fit the model, you can use printcp() . For example, using the on-line example, > library(rpart) > fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis) > printcp(fit) Classification tree: rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis) Variables actually used in tree construction: [1] Age Start Root node error: 17/81 = 0.20988 n= 81 CP nsplit rel error xerror xstd 1 0.176471 0 1.00000 1.00000 0.21559

How do I find which attributes my tree splits on, when using scikit-learn?

懵懂的女人 提交于 2019-11-29 19:43:58
I have been exploring scikit-learn, making decision trees with both entropy and gini splitting criteria, and exploring the differences. My question, is how can I "open the hood" and find out exactly which attributes the trees are splitting on at each level, along with their associated information values, so I can see where the two criterion make different choices? So far, I have explored the 9 methods outlined in the documentation. They don't appear to allow access to this information. But surely this information is accessible? I'm envisioning a list or dict that has entries for node and gain.

What does the value of 'leaf' in the following xgboost model tree diagram means?

元气小坏坏 提交于 2019-11-29 13:01:22
问题 I am guessing that it is conditional probability given that the above (tree branch) condition exists. However, I am not clear on it. If you want to read more about the data used or how do we get this diagram then go to : http://machinelearningmastery.com/visualize-gradient-boosting-decision-trees-xgboost-python/ 回答1: Attribute leaf is the predicted value. In other words, if the evaluation of a tree model ends at that terminal node (aka leaf node), then this is the value that is returned. In

Spark MLib Decision Trees: Probability of labels by features?

早过忘川 提交于 2019-11-29 12:05:17
I could manage to display total probabilities of my labels , for example after displaying my decision tree, I have a table : Total Predictions : 65% impressions 30% clicks 5% conversions But my issue is to find probabilities (or to count) by features (by node), for example : if feature1 > 5 if feature2 < 10 Predict Impressions samples : 30 Impressions else feature2 >= 10 Predict Clicks samples : 5 Clicks Scikit does it automatically , I am trying to find a way to do it with Spark Note: the following solution is for Scala only. I didn't find a way to do it in Python. Assuming you just want a