Spark's dataframe count() function taking very long

余生颓废 提交于 2019-12-11 15:35:25

问题


In my code, I have a sequence of dataframes where I want to filter out the dataframe's which are empty. I'm doing something like:

Seq(df1, df2).map(df => df.count() > 0)

However, this is taking extremely long and is consuming around 7 minutes for approximately 2 dataframe's of 100k rows each.

My question: Why is Spark's implementation of count() is slow. Is there a work-around?


回答1:


Count is a lazy operation. So it does not matter how big is your dataframe. But if you have too many costly operations on the data to get this dataframe, then once the count is called spark would actually do all the operations to get these dataframe.

Some of the costly operations may be operations which needs shuffling of data. Like groupBy, reduce etc.

So my guess is you have some complex processing to get these dataframes or your initial data which you used to get this dataframe is too huge.



来源:https://stackoverflow.com/questions/45859296/sparks-dataframe-count-function-taking-very-long

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!