Difference between DataFrame, Dataset, and RDD in Spark

后端未结

关注

 15  1448

慢半拍i 2020-11-22 15:53

I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

15条回答礼貌的吻别 (楼主) 2020-11-22 16:14 Most of answers are correct only want to add one point here In Spark 2.0 the two APIs (DataFrame +DataSet) will be unified together into a single API. "Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface." Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network. Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application. The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime. Here you can find RDD tof Data frame conversation answer How to convert rdd object to dataframe in spark 0 讨论(0) 查看其它15个回答发布评论: 提交评论加载中... 验证码看不清? 提交回复