Spark, ML, StringIndexer: handling unseen labels

前端 未结 5 449
夕颜
夕颜 2020-12-08 05:12

My goal is to build a multicalss classifier.

I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each c

5条回答
  •  伪装坚强ぢ
    2020-12-08 05:52

    No nice way to do it, I'm afraid. Either

    • filter out the test examples with unknown labels before applying StringIndexer
    • or fit StringIndexer to the union of train and test dataframe, so you are assured all labels are there
    • or transform the test example case with unknown label to a known label

    Here is some sample code to perform above operations:

    // get training labels from original train dataframe
    val trainlabels = traindf.select(colname).distinct.map(_.getString(0)).collect  //Array[String]
    // or get labels from a trained StringIndexer model
    val trainlabels = simodel.labels 
    
    // define an UDF on your dataframe that will be used for filtering
    val filterudf = udf { label:String => trainlabels.contains(label)}
    
    // filter out the bad examples 
    val filteredTestdf = testdf.filter( filterudf(testdf(colname)))
    
    // transform unknown value to some value, say "a"
    val mapudf = udf { label:String => if (trainlabels.contains(label)) label else "a"}
    
    // add a new column to testdf: 
    val transformedTestdf = testdf.withColumn( "newcol", mapudf(testdf(colname)))
    

提交回复
热议问题