Difference between DataFrame, Dataset, and RDD in Spark

后端 未结 15 1444
慢半拍i
慢半拍i 2020-11-22 15:53

I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

15条回答
  •  半阙折子戏
    2020-11-22 16:09

    All(RDD, DataFrame and DataSet) in one picture.

    RDD vs DataFrame vs DataSet

    image credits

    RDD

    RDD is a fault-tolerant collection of elements that can be operated on in parallel.

    DataFrame

    DataFrame is a Dataset organised into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimisations under the hood.

    Dataset

    Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.


    Note:

    Dataset of Rows (Dataset[Row]) in Scala/Java will often refer as DataFrames.


    Nice comparison of all of them with a code snippet.

    RDD vs DataFrame vs DataSet with code

    source


    Q: Can you convert one to the other like RDD to DataFrame or vice-versa?

    Yes, both are possible

    1. RDD to DataFrame with .toDF()

    val rowsRdd: RDD[Row] = sc.parallelize(
      Seq(
        Row("first", 2.0, 7.0),
        Row("second", 3.5, 2.5),
        Row("third", 7.0, 5.9)
      )
    )
    
    val df = spark.createDataFrame(rowsRdd).toDF("id", "val1", "val2")
    
    df.show()
    +------+----+----+
    |    id|val1|val2|
    +------+----+----+
    | first| 2.0| 7.0|
    |second| 3.5| 2.5|
    | third| 7.0| 5.9|
    +------+----+----+
    

    more ways: Convert an RDD object to Dataframe in Spark

    2. DataFrame/DataSet to RDD with .rdd() method

    val rowsRdd: RDD[Row] = df.rdd() // DataFrame to RDD
    

提交回复
热议问题