decision-tree | 易学教程

how to pass mixed (categorical and numeric) features to Decision Tree Regressor in sklearn?

阅读更多关于 how to pass mixed (categorical and numeric) features to Decision Tree Regressor in sklearn?

问题 How can I pass Categorical and numeric features to DecisionTreeRegressor in sklearn? below code shows how to use the code in general for numeric features: make_tree = tree.DecisionTreeRegressor() fit_tree = make_tree.fit(X_train, y_train) 回答1: First, all categorical features should be encoded (represented by numbers) to be interpretable for the regression models. To do so, you can use, LabelEncoder followed by OneHotEncoder. In the case of high-cardinal features, you can use FeatureHasher. As

Convert Weka tree into hierachyid for SQL hierachical table

阅读更多关于 Convert Weka tree into hierachyid for SQL hierachical table

问题 This question relates to the answer given in this post. I want to convert the output from a tree analysis in Weka into a hierarchical table of decision splits and leaf-values (as per the post linked above). I can parse the Weka output to extract the fac , split and val values but I'm struggling to parse the output and generate the correct hierachyid values. First thing I note is that the tree description don't map one-to-one with the records in decisions . There are 20 lines in the Weka

Machine learning algorithm score changes without any change in data or step

阅读更多关于 Machine learning algorithm score changes without any change in data or step

问题 I am new to Machine learning and getting started with Titanic problem on Kaggle. I have written a simple algorithm to predict the result on test data. My question/confusion is, every time, I execute the algorithm with the same dataset and the same steps, the score value changes (last statement in the code). I am not able to understand this behaviour? Code: # imports import numpy as np import pandas as pd from sklearn.tree import DecisionTreeClassifier # load data train = pd.read_csv('train

Equivalent of predict_proba for DecisionTreeRegressor

阅读更多关于 Equivalent of predict_proba for DecisionTreeRegressor

问题 scikit-learn's DecisionTreeClassifier supports predicting probabilities of each class via the predict_proba() function. This is absent from DecisionTreeRegressor : AttributeError: 'DecisionTreeRegressor' object has no attribute 'predict_proba' My understanding is that the underlying mechanics are pretty similar between decision tree classifiers and regressors, with the main difference being that predictions from the regressors are calculated as means of potential leafs. So I'd expect it to be

One-Hot Encoding in Scikit-learn for only part of the DataFrame

阅读更多关于 One-Hot Encoding in Scikit-learn for only part of the DataFrame

问题 I am trying to use a decision tree classier on my data which looks very similar to the data in this tutorial: https://www.ritchieng.com/machinelearning-one-hot-encoding/ The tutorial then goes on convert the strings into numeric data: X = pd.read_csv('titanic_data.csv') X = X.select_dtypes(include=[object]) le = preprocessing.LabelEncoder() X_2 = X.apply(le.fit_transform) This leaves the DataFrame looking like this: After this, the data is put through the OneHotEncoder and I assume can then

WEKA-generated models does not seem to predict class and distribution given the attribute index

阅读更多关于 WEKA-generated models does not seem to predict class and distribution given the attribute index

问题 Overview I am using the WEKA API 3.7.10 (developer version) to use my pre-made .model files. I made 25 models: five outcome variables for five algorithms. J48 decision tree. Alternating decision tree Random forest LogitBoost Random subspace I am having problems with J48, Random subspace and random forest. Necessary files The following is the ARFF representation of my data after creation: @relation WekaData @attribute ageDiagNum numeric @attribute raceGroup {Black,Other,Unknown,White}

scikit-learn Decision trees Regression: retrieve all samples for leaf (not mean)

阅读更多关于 scikit-learn Decision trees Regression: retrieve all samples for leaf (not mean)

问题 I have started using scikit-learn Decision Trees and so far it is working out quite well but one thing I need to do is retrieve the set of sample Y values for the leaf node, especially when running a prediction. That is given an input feature vector X, I want to know the set of corresponding Y values at the leaf node instead of just the regression value which is the mean (or median) of those values. Of course one would want the sample mean to have a small variance but I do want to extract the

Sckit learn with GraphViz exports empty outputs

阅读更多关于 Sckit learn with GraphViz exports empty outputs

问题 I would like to export decision tree using sklearn. First I trained a decision tree classifier: self._selected_classifier = tree.DecisionTreeClassifier() self._selected_classifier.fit(train_dataframe, train_class) self._column_names = list(train_dataframe.columns.values) After that I used the following method in order to export the decision tree: def _create_graph_visualization(self): decision_tree_classifier = self._selected_classifier from sklearn.externals.six import StringIO dot_data =

Building a Decision Tree

阅读更多关于 Building a Decision Tree

问题 When building a decision tree, at each node, we select the best feature, and then the best splitting position for that feature. However, when all values for the best feature is 0 for samples in the current node /set, what do I do? All samples keep being grouped to one side (the <= 0 branch), and an infinite loop occurs. For example: #left: 1500, #right: 0 then, #left: 1500, #right: 0 and so on... Just for reference, I'm following the following pseudo-code. GrowTree(S) if (y_i = C for all i in

How to graph a tree with graphviz?

阅读更多关于 How to graph a tree with graphviz?

问题 I can't reproduce a simple example. Here is how it goes: import pandas as pd import numpy as np import sklearn as skl from sklearn import tree from sklearn.cross_validation import train_test_split as tts # import data and give a little overview sample = pd.read_stata('sample_data.dta') s = sample # Let's learn X = np.array((s.crisis, s.cash, s.industry, s.current_debt, s.activity)).reshape(1000, 5) y = np.array(s.wc_measure) X_train, X_test, y_train, y_test = tts(X, y, test_size = .8) my_tree