问题
I have seen DAG getting generated whenever we perform any operation on RDD but what happens when we perform operations on our dataframe?
When executing multiple operations on dataframe, Are those lazily evaluated just like RDD?
When the catalyst optimizer comes into the picture?
I am sort of confused between these. If anyone can throw some light on these topics, it would be really of great help.
Thanks
回答1:
Every operation on a Dataset
, continuous processing mode notwithstanding, is translated into a sequence of operations on internal RDDs
. Therefore concept of DAG is by all means applicable.
By extension, execution is primarily lazy, though as always exceptions exists, and are more common in Dataset
API, compared to pure RDD
API.
Finally Catalyst is responsible for transforming Dataset
API calls, into logical, optimized logical and physical execution plan, and finally generating code which will executed by the tasks.
回答2:
RDD is building block of spark. No matter which abstraction Dataframe or Dataset we use, internally final computation is done on RDDs.
i.e - When you perform operation on Dataframes that time also DAG created.
below link is helpful https://medium.com/@thejasbabu/spark-dataframes-10c349de04c
for catalyst optimizer
You can follow below link for more info https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783987061/4/ch04lvl1sec31/understanding-the-catalyst-optimizer
来源:https://stackoverflow.com/questions/54375492/is-dag-created-when-we-perform-operations-over-dataframes