When to use Spark DataFrame/Dataset API and when to use plain RDD?

天涯浪子 提交于 2019-12-01 16:42:31

From this Databricks' blog article A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

When to use RDDs?

Consider these scenarios or common use cases for using RDDs when:

  • you want low-level transformation and actions and control on your dataset;
  • your data is unstructured, such as media streams or streams of text;
  • you want to manipulate your data with functional programming constructs than domain specific expressions;
  • you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column;
  • and you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

In High Performance Spark's Chapter 3. DataFrames, Datasets, and Spark SQL, you can see some performance you can get with the Dataframe/Dataset API compared to RDD

And in the Databricks' article mentioned you can also find that Dataframe optimizes space usage compared to RDD

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!