Should I Avoid groupby() in Dataset/Dataframe? [duplicate]

痞子三分冷 提交于 2019-12-06 06:04:26

The groupBy should be used at Dataframes and Datasets. You thinking is complete right, the Catalyst Optimizer will build the plan and optimize all the entrances in GroupBy and other aggregations that you want to do.

There is a good example, that is in spark 1.4 on this link that show the comparison of reduceByKey with RDD and GroupBy with DataFrame.

And you can see that is really much more faster than RDD, so groupBy optimize all the execution for more details you can see the oficial post of DataBricks with the introduction of DataFrames

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!