Dataframe state before save and after load - what's different?

巧了我就是萌 提交于 2019-12-14 03:53:17

问题


I have a DF that contains some SQL expressions (coalesce, case/when etc.). I later try to map/flatMap this DF where I get an Task not serializable error, due to the fields that contain the SQL expressions.

(Why I need to map/flatMap this DF is a separate question)

When I save this DF to a Parquet file and load it afterwards, the error is gone and I can convert to RDD and do transformations no problem!

How is the DF different before saving and after loading? In some way, the SQL expressions must have been evaluated and made persistent. How can I achieve the same thing without saving/loading? (df.perists() did not do the trick ;()

Here's test code:

val data = Seq( (1, "sku1", "EUR", 99.0, 89.0), (2, "sku2", "USD", 89.0, 79.0),  (3, "sku3", "USD", 49.0, 39.9) )
val aditionalStuffForCertainSKUsMap = Map("sku1" -> List(10, 20, 30))

val listedPrice = coalesce(
    List("EUR", "USD").map(c => when($"currency" === c, col(c)).otherwise(lit(null))): _*)

val df = (sc.parallelize(data)
    .toDF("id", "sku", "currency", "EUR", "USD")
    .withColumn("price_in_given_currency",  when($"currency" === "EUR", $"EUR"*2).otherwise(1))
 //   .withColumn("fails_price_in_given_currency", listedPrice)
)
df.show
df.write.mode(SaveMode.Overwrite).parquet("test_df")

The data contains a sku and some SKUs represent bundles, like sku1, for which I want to add some other fields to the DF. Only when I try to access this Map[String, List[Int]] within the map() I get complaints with the fails_price_in_given_currency column, not so with the price_in_given_currency:

// If I load the df first, the map() works even when using `fails_price_in_given_currency`
//val df = sqlContext.read.parquet("test_df") 

val out = df.map(d => {
  val key = d.getAs[String]("sku")
  aditionalStuffForCertainSKUsMap.getOrElse(key, None)
})

The error is thrown when I use fails_price_in_given_currency instead. If I however load df before the map, it will run again!

来源:https://stackoverflow.com/questions/33707191/dataframe-state-before-save-and-after-load-whats-different

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!