Which is efficient, Dataframe or RDD or hiveql?

£可爱£侵袭症+ 提交于 2019-12-21 09:16:22

问题


I am newbie to Apache Spark.

My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file.

For example,

CSV1

name,age,deparment_id

CSV2

department_id,deparment_name,location

I want to get a third CSV file with

name,age,deparment_name

I am loading both the CSV into dataframes. And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe

I am also able to do the same using several RDD.map()

And I am also able to do the same using executing hiveql using HiveContext

I want to know which is the efficient way if my CSV files are huge and why?


回答1:


Both DataFrames and spark sql queries are optimized using the catalyst engine, so I would guess they will produce similar performance (assuming you are using version >= 1.3)

And both should be better than simple RDD operations, because for RDDs, spark don't have any knowledge about the types of your data, so it can't do any special optimizations




回答2:


This blog contains the benchmarks. Dataframes is much more efficient than RDD

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Here is the snippet from blog

At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic. Second, Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames.

Here is the performance benchmark https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM.png




回答3:


Overall direction for Spark is to go with dataframes, so that query is optimized through catalyst



来源:https://stackoverflow.com/questions/31453324/which-is-efficient-dataframe-or-rdd-or-hiveql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!