Difference between DataFrame, Dataset, and RDD in Spark

后端未结

关注

 15  1444

慢半拍i 2020-11-22 15:53

I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

15条回答半阙折子戏 (楼主) 2020-11-22 16:09 All(RDD, DataFrame and DataSet) in one picture. image credits RDD RDD is a fault-tolerant collection of elements that can be operated on in parallel. DataFrame DataFrame is a Dataset organised into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimisations under the hood. Dataset Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. Note: Dataset of Rows (Dataset[Row]) in Scala/Java will often refer as DataFrames. Nice comparison of all of them with a code snippet. source Q: Can you convert one to the other like RDD to DataFrame or vice-versa? Yes, both are possible 1. RDD to DataFrame with .toDF() val rowsRdd: RDD[Row] = sc.parallelize( Seq( Row("first", 2.0, 7.0), Row("second", 3.5, 2.5), Row("third", 7.0, 5.9) ) ) val df = spark.createDataFrame(rowsRdd).toDF("id", "val1", "val2") df.show() +------+----+----+ | id|val1|val2| +------+----+----+ | first| 2.0| 7.0| |second| 3.5| 2.5| | third| 7.0| 5.9| +------+----+----+ more ways: Convert an RDD object to Dataframe in Spark 2. DataFrame/DataSet to RDD with .rdd() method val rowsRdd: RDD[Row] = df.rdd() // DataFrame to RDD 0 讨论(0) 查看其它15个回答发布评论: 提交评论加载中... 验证码看不清? 提交回复

Difference between DataFrame, Dataset, and RDD in Spark

All(RDD, DataFrame and DataSet) in one picture.

`RDD`

`DataFrame`

`Dataset`

`Nice comparison of all of them with a code snippet.`

Yes, both are possible