Difference between DataFrame, Dataset, and RDD in Spark

后端 未结 15 1449
慢半拍i
慢半拍i 2020-11-22 15:53

I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

15条回答
  •  礼貌的吻别
    2020-11-22 16:11

    A Dataframe is an RDD of Row objects, each representing a record. A Dataframe also knows the schema (i.e., data fields) of its rows. While Dataframes look like regular RDDs, internally they store data in a more efficient manner, taking advantage of their schema. In addition, they provide new operations not available on RDDs, such as the ability to run SQL queries. Dataframes can be created from external data sources, from the results of queries, or from regular RDDs.

    Reference: Zaharia M., et al. Learning Spark (O'Reilly, 2015)

提交回复
热议问题