问题
I have a DF that contains some SQL expressions (coalesce, case/when etc.).
I later try to map/flatMap this DF where I get an Task not serializable
error, due to the fields that contain the SQL expressions.
(Why I need to map/flatMap this DF is a separate question)
When I save this DF to a Parquet file and load it afterwards, the error is gone and I can convert to RDD and do transformations no problem!
How is the DF different before saving and after loading? In some way, the SQL expressions must have been evaluated and made persistent. How can I achieve the same thing without saving/loading? (df.perists() did not do the trick ;(
)
Here's test code:
val data = Seq( (1, "sku1", "EUR", 99.0, 89.0), (2, "sku2", "USD", 89.0, 79.0), (3, "sku3", "USD", 49.0, 39.9) )
val aditionalStuffForCertainSKUsMap = Map("sku1" -> List(10, 20, 30))
val listedPrice = coalesce(
List("EUR", "USD").map(c => when($"currency" === c, col(c)).otherwise(lit(null))): _*)
val df = (sc.parallelize(data)
.toDF("id", "sku", "currency", "EUR", "USD")
.withColumn("price_in_given_currency", when($"currency" === "EUR", $"EUR"*2).otherwise(1))
// .withColumn("fails_price_in_given_currency", listedPrice)
)
df.show
df.write.mode(SaveMode.Overwrite).parquet("test_df")
The data contains a sku
and some SKUs represent bundles, like sku1, for which I want to add some other fields to the DF. Only when I try to access this Map[String, List[Int]] within the map() I get complaints with the fails_price_in_given_currency
column, not so with the price_in_given_currency
:
// If I load the df first, the map() works even when using `fails_price_in_given_currency`
//val df = sqlContext.read.parquet("test_df")
val out = df.map(d => {
val key = d.getAs[String]("sku")
aditionalStuffForCertainSKUsMap.getOrElse(key, None)
})
The error is thrown when I use fails_price_in_given_currency
instead. If I however load df
before the map, it will run again!
来源:https://stackoverflow.com/questions/33707191/dataframe-state-before-save-and-after-load-whats-different