Difference between DataFrame, Dataset, and RDD in Spark

后端 未结 15 1451
慢半拍i
慢半拍i 2020-11-22 15:53

I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

15条回答
  •  礼貌的吻别
    2020-11-22 16:17

    Simply RDD is core component, but DataFrame is an API introduced in spark 1.30.

    RDD

    Collection of data partitions called RDD. These RDD must follow few properties such is:

    • Immutable,
    • Fault Tolerant,
    • Distributed,
    • More.

    Here RDD is either structured or unstructured.

    DataFrame

    DataFrame is an API available in Scala, Java, Python and R. It allows to process any type of Structured and semi structured data. To define DataFrame, a collection of distributed data organized into named columns called DataFrame. You can easily optimize the RDDs in the DataFrame. You can process JSON data, parquet data, HiveQL data at a time by using DataFrame.

    val sampleRDD = sqlContext.jsonFile("hdfs://localhost:9000/jsondata.json")
    
    val sample_DF = sampleRDD.toDF()
    

    Here Sample_DF consider as DataFrame. sampleRDD is (raw data) called RDD.

提交回复
热议问题