How do I run the Spark decision tree with a categorical feature set using Scala?

后端未结

关注

 3  2804

你的背包 2021-02-20 18:37

I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class

3条回答

终归单人心 (楼主)

2021-02-20 19:28
Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.

So for example, if you have the following dataset:
```
id,String,Intvalue
1,"a",123
2,"b",456
3,"c",789
4,"a",887
```
Then you could split your string data, making each value of the strings into a new column
```
a -> 1,0,0
b -> 0,1,0
c -> 0,0,1
```
As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.

Now your dataset will be
```
id,String,Intvalue
1,1,0,0,123
2,0,1,0,456
3,0,0,1,789
4,1,0,0,887
```
Which now you can convert into Double values and use it into your LabeledPoint.

Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be
```
a = 0
b = 1
c = 2
```
But in this case the algorithms will consider a closer to b than to c, which cannot be determined.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...