I'm trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:
def getCategory(categories:Array[String], input:String): String = { categories(input.toInt) } val myArray = Array("a", "b", "c") val myCategories =udf(getCategory _ ) val df = sqlContext.parquetFile("myfile.parquet) val df1 = df.withColumn("newCategory", myCategories(lit(myArray), col("myInput"))
However, lit doesn't like arrays and this script errors. I tried definining a new partially applied function and then the udf after that :
val newFunc = getCategory(myArray, _:String) val myCategories = udf(newFunc) val df1 = df.withColumn("newCategory", myCategories(col("myInput")))
This doesn't work either as I get a nullPointer exception and it appears myArray is not being recognized. Any ideas on how I pass an array as a parameter to a function with a dataframe?
On a separate note, any explanation as to why doing something simple like using a function on a dataframe is so complicated (define function, redefine it as UDF, etc, etc)?