Difference between DataFrame, Dataset, and RDD in Spark

后端 未结 15 1416
慢半拍i
慢半拍i 2020-11-22 15:53

I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

15条回答
  •  长发绾君心
    2020-11-22 16:12

    A DataFrame is an RDD that has a schema. You can think of it as a relational database table, in that each column has a name and a known type. The power of DataFrames comes from the fact that, when you create a DataFrame from a structured dataset (Json, Parquet..), Spark is able to infer a schema by making a pass over the entire (Json, Parquet..) dataset that's being loaded. Then, when calculating the execution plan, Spark, can use the schema and do substantially better computation optimizations. Note that DataFrame was called SchemaRDD before Spark v1.3.0

提交回复
热议问题