问题
I'm reading a json file of a social network into spark. I get from these a data frame which I explode to get pairs. This process works perfect. Later I want to convert this to RDD (For use with GraphX) but the RDD creation takes a very long time.
val social_network = spark.read.json(my/path) // 200MB
val exploded_network = social_network.
withColumn("follower", explode($"followers")).
withColumn("id_follower", ($"follower").cast("long")).
withColumn("id_account", ($"account").cast("long")).
withColumn("relationship", lit(1)).
select("id_follower", "id_account", "relationship")
val E1 = exploded_network.as[(VertexId, VertexId, Int)]
val E2 = E1.rdd
To check for how the process is run, I count at each step
scala> exploded_network.count
res0: Long = 18205814 // 3 seconds
scala> E1.count
res1: Long = 18205814 // 3 seconds
scala> E2.count // 5.4 minutes
res2: Long = 18205814
Why is RDD conversion taking 100x?
回答1:
In Spark, a DataFrame is a distributed collection of data organized into named columns(tabular format). It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations. And also due to its tabular format, it has metadata which allows spark to run number of optimizations in the background. DataFrame API uses spark’s advanced optimizations like the Tungsten execution engine and catalyst optimizer to better process the data.
Whereas in a RDD, RDD's don't infer the schema of given data set and requires the user to provide any schema.Also Rdd's cannot take advantage of spark's optimizers like Catalyst optimizer and Tungsten execution engine(as mentioned above).
So DataFrame's have much better performance than RDD's. In your case, if you have to use an RDD instead of dataframe, I would recommend to cache the dataframe before converting to rdd. That should improve your rdd performance.
val E1 = exploded_network.cache()
val E2 = E1.rdd
Hope this helps.
来源:https://stackoverflow.com/questions/42906387/spark-dataframe-conversion-to-rdd-takes-a-long-time