Spark MLLib 2.0 Categorical Features in pipeline

不问归期 提交于 2019-12-05 22:49:47

The DecisionTree algorithm takes a single maxBins value to decide the number of splits to take. The default value is (=32). maxBins should be greater or equal to the maximum number of categories for categorical features. Since your feature 5 has 49 different values you need to increase maxBins to 49 or greater.

The DecisionTree algorithm has several hyperparameters, and tuning them to your data can improve accuracy. You can do this tuning using Spark's Cross Validation framework, which automatically tests a grid of hyperparameters and chooses the best.

Here is example in python testing 3 maxBins [49, 52, 55]

dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed")
paramGrid = ParamGridBuilder().addGrid(dt.maxBins, [49, 52, 55]).addGrid(dt.maxDepth, [4, 6, 8]).addGrid(rf.impurity, ["entropy", "gini"]).build()
pipeline = Pipeline(stages=[labelIndexer, typeIndexer, assembler, dt])
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!