Making the features of test data same as train data after featureselection in spark

流过昼夜 提交于 2019-12-01 09:21:24

问题


I m working on Scala. I have a big question, ChiSqSelector seems to reduce dimension successfully, but I can't identify what features were reduced what were remained. How can I know what features were reduced?

[WrappedArray(a, b, c),(5,[1,2,3],[1,1,1]),(2,[0],[1])]
[WrappedArray(b, d, e),(5,[0,2,4],[1,1,2]),(2,[1],[2])]
[WrappedArray(a, c, d),(5,[0,1,3],[1,1,1]),(2,[0],[1])]

PS: when I wanted to make the test data same as feature-selected train data I found that I dont know how to do that in scala.


回答1:


If you use MLlib version of the ChiSqSelector you can selectedFeatures:

mllibModel: org.apache.spark.mllib.feature.ChiSqSelectorModel = ???
val features: Array[Int] = mllib.selectedFeatures

Nevertheless when you work with test data it is better to use selector trained on the train dataset and don't bother with manual selection.

val testData: RDD[org.apache.spark.mllib.linalg.Vector] = ???
mllibModel.transform(testData)

The same rules apply to ML version. You can use selectedFeatures to extract array of indices:

val mlModel: org.apache.spark.ml.feature.ChiSqSelectorModel = ???
val features: Array[Int] = mlModel.selectedFeatures

but it is still better to keep model and reuse on new data:

val testData: RDD[org.apache.spark.sql.DataFrame] = ???
mlModel.transform(testData)

If you want a human readable list of features you can analyze column metadata after transformation as shown in Tagging columns as Categorical in Spark



来源:https://stackoverflow.com/questions/35886979/making-the-features-of-test-data-same-as-train-data-after-featureselection-in-sp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!