SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?

懵懂的女人 提交于 2019-12-13 15:40:27

问题


I've got a LabeledPoint on witch I want to run a decision tree (and later random forest)

scala> transformedData.collect
res8: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0,...

using code:

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.impurity.Gini

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]() //change to what?
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(
  trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

In my data I've got two types of features:

  1. some features are counts from user visits on a given website/domain (feature is a website/domain and its value is number of visits)
  2. rest of the features are some declarative variables - binary/categorical

    Is there a way to create categoricalFeaturesInfo automatically from LabeledPoint? I want to check the levels of my declarative variables (type 2), then having this information create categoricalFeaturesInfo.

I have a list with the the declarative variables:

List(6363,21345,23455,...

回答1:


categoricalFeaturesInfo should map from an index to a number of classes for a given feature. Generally speaking identifying categorical variables can be expensive, especially if these are heavily mixed with continuous variables. Moreover, depending on your data, it can give both false positive and false negatives. Keeping that in mind it is better to set these values manually.

If you still want to create categoricalFeaturesInfo automatically you can take a look at the ml.feature.VectorIndexer. It is not directly applicable in this case but should provide an useful code base to build your own solution.



来源:https://stackoverflow.com/questions/33956720/spark-how-to-create-categoricalfeaturesinfo-for-decision-trees-from-labeledpoin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!