坑:
- Spark Xgboost 对 spark的dataframe 的空值非常敏感,如果dataframe里有空值(null , “NaN”),xgboost就会报错。
- Spark2.4.4 的 Vector Assemble转换dataframe以后,对于0很多的行,会默认转成sparse vector,造成xgboost报错
示例代码:
val schema = new StructType(Array(
StructField("BIZ_DATE", StringType, true),
StructField("SKU", StringType, true),
StructField("WINDGUST", DoubleType, true),
StructField("WINDSPEED", DoubleType, true)))
val predictDF = spark.read.schema(schema)
.format("csv")
.option("header", "true")
.option("delimiter", ",")
.load("/mnt/parquet/smaller.csv")
import scala.collection.mutable.ArrayBuffer
val featureColsBuffer=ArrayBuffer[String]()
for (i <- predictDF.columns){
if(i != "QTY" & i != "BIZ_DATE" & i!="SKU" & i!="STORE"){
featureColsBuffer += i
}
}
//选择要参与训练的feature column
val featureCols = featureColsBuffer.toArray
来源:CSDN
作者:爱知菜
链接:https://blog.csdn.net/rav009/article/details/103770493