I\'m still struggling to understand the full power of the recently introduced Spark Datasets.
Are there best practices of when to use RDDs and when to use Datasets?<
DataSet 1) It is an structured API Provided by Spark to work on Table like structure. Where you can do your analysis or data manipulation just like the Tables on any DataBase. 2) It is a subset of DataFrame . If you chek the link you will come to lots of functions or methods supported for the DataSet http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset 3) It is an High Level API
RDD 1)Are know as Resilient Distributed Datasets (RDD) 2) It is an core level API of Spark. 3) When ever you work on any DataFrame or Data sets those are converted to low level API i.e. RDD 4) These are use full whenever the business needs are exceptional and you cannot perform manipulations on DataFrame or DataSets at that time RDD can be used. 5) you need to do some custom shared variable manipulation