I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

image credits
RDD
RDDis a fault-tolerant collection of elements that can be operated on in parallel.
DataFrame
DataFrameis a Dataset organised into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimisations under the hood.
Dataset
Datasetis a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
Note:
Dataset of Rows (
Dataset[Row]) in Scala/Java will often refer as DataFrames.
Nice comparison of all of them with a code snippet.
source
Q: Can you convert one to the other like RDD to DataFrame or vice-versa?
1. RDD to DataFrame with .toDF()
val rowsRdd: RDD[Row] = sc.parallelize(
Seq(
Row("first", 2.0, 7.0),
Row("second", 3.5, 2.5),
Row("third", 7.0, 5.9)
)
)
val df = spark.createDataFrame(rowsRdd).toDF("id", "val1", "val2")
df.show()
+------+----+----+
| id|val1|val2|
+------+----+----+
| first| 2.0| 7.0|
|second| 3.5| 2.5|
| third| 7.0| 5.9|
+------+----+----+
more ways: Convert an RDD object to Dataframe in Spark
2. DataFrame/DataSet to RDD with .rdd() method
val rowsRdd: RDD[Row] = df.rdd() // DataFrame to RDD