Difference between DataFrame, Dataset, and RDD in Spark

后端未结

关注

 15  1457

慢半拍i 2020-11-22 15:53

I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

15条回答鱼传尺愫 (楼主) 2020-11-22 16:22 A DataFrame is defined well with a google search for "DataFrame definition": A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained. However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method In general it is recommended to use a DataFrame where possible due to the built in query optimization. 0 讨论(0) 查看其它15个回答发布评论: 提交评论加载中... 验证码看不清? 提交回复