Pass array as an UDF parameter in Spark SQL

前端 未结 1 1556
花落未央
花落未央 2020-12-17 21:40

I\'m trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:

def getCategory(categories:Array         


        
相关标签:
1条回答
  • 2020-12-17 22:12

    Most likely not the prettiest solution but you can try something like this:

    def getCategory(categories: Array[String]) = {
        udf((input:String) => categories(input.toInt))
    }
    
    df.withColumn("newCategory", getCategory(myArray)(col("myInput")))
    

    You could also try an array of literals:

    val getCategory = udf(
       (input:String, categories: Array[String]) => categories(input.toInt))
    
    df.withColumn(
      "newCategory", getCategory($"myInput", array(myArray.map(lit(_)): _*)))
    

    On a side note using Map instead of Array is probably a better idea:

    def mapCategory(categories: Map[String, String], default: String) = {
        udf((input:String) =>  categories.getOrElse(input, default))
    }
    
    val myMap = Map[String, String]("1" -> "a", "2" -> "b", "3" -> "c")
    
    df.withColumn("newCategory", mapCategory(myMap, "foo")(col("myInput")))
    

    Since Spark 1.5.0 you can also use an array function:

    import org.apache.spark.sql.functions.array
    
    val colArray = array(myArray map(lit  _): _*)
    myCategories(lit(colArray), col("myInput"))
    

    See also Spark UDF with varargs

    0 讨论(0)
提交回复
热议问题