spark dataframe conversion to rdd takes a long time

佐手、 提交于 2020-12-06 01:37:43

问题


I'm reading a json file of a social network into spark. I get from these a data frame which I explode to get pairs. This process works perfect. Later I want to convert this to RDD (For use with GraphX) but the RDD creation takes a very long time.

val social_network = spark.read.json(my/path) // 200MB
val exploded_network = social_network.
    withColumn("follower", explode($"followers")).
    withColumn("id_follower", ($"follower").cast("long")).
    withColumn("id_account", ($"account").cast("long")).
    withColumn("relationship", lit(1)).
    select("id_follower", "id_account", "relationship")
val E1 = exploded_network.as[(VertexId, VertexId, Int)]
val E2 = E1.rdd

To check for how the process is run, I count at each step

scala> exploded_network.count
res0: Long = 18205814 // 3 seconds

scala> E1.count
res1: Long = 18205814 // 3 seconds

scala> E2.count // 5.4 minutes
res2: Long = 18205814

Why is RDD conversion taking 100x?


回答1:


In Spark, a DataFrame is a distributed collection of data organized into named columns(tabular format). It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations. And also due to its tabular format, it has metadata which allows spark to run number of optimizations in the background. DataFrame API uses spark’s advanced optimizations like the Tungsten execution engine and catalyst optimizer to better process the data.

Whereas in a RDD, RDD's don't infer the schema of given data set and requires the user to provide any schema.Also Rdd's cannot take advantage of spark's optimizers like Catalyst optimizer and Tungsten execution engine(as mentioned above).

So DataFrame's have much better performance than RDD's. In your case, if you have to use an RDD instead of dataframe, I would recommend to cache the dataframe before converting to rdd. That should improve your rdd performance.

val E1 = exploded_network.cache()
val E2 = E1.rdd

Hope this helps.



来源:https://stackoverflow.com/questions/42906387/spark-dataframe-conversion-to-rdd-takes-a-long-time

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!