How do I run the Spark decision tree with a categorical feature set using Scala?

喜夏-厌秋 提交于 2019-12-04 02:57:19
lam

You can first transform categories to numbers, then load data as if all features are numerical.

When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]() from feature indices to its arity.

For example if you have data as:

1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me

You can first transform data into numerical format as:

1,0,0
2,1,1
1,2,2
3,0,3
1,2,4

In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:

categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))

The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:

val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
dirceusemighini

Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.

So for example, if you have the following dataset:

id,String,Intvalue
1,"a",123
2,"b",456
3,"c",789
4,"a",887

Then you could split your string data, making each value of the strings into a new column

a -> 1,0,0
b -> 0,1,0
c -> 0,0,1

As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.

Now your dataset will be

id,String,Intvalue
1,1,0,0,123
2,0,1,0,456
3,0,0,1,789
4,1,0,0,887

Which now you can convert into Double values and use it into your LabeledPoint.

Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be

a = 0
b = 1
c = 2

But in this case the algorithms will consider a closer to b than to c, which cannot be determined.

You need to confirm the type of array x. From the error log, it said that the item in array x is string which is not supported in spark. Current spark Vectors can only be filled by Double.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!