Spark's dataframe count() function taking very long

问题

In my code, I have a sequence of dataframes where I want to filter out the dataframe's which are empty. I'm doing something like:

Seq(df1, df2).map(df => df.count() > 0)

However, this is taking extremely long and is consuming around 7 minutes for approximately 2 dataframe's of 100k rows each.

My question: Why is Spark's implementation of count() is slow. Is there a work-around?

回答1:

Count is a lazy operation. So it does not matter how big is your dataframe. But if you have too many costly operations on the data to get this dataframe, then once the count is called spark would actually do all the operations to get these dataframe.

Some of the costly operations may be operations which needs shuffling of data. Like groupBy, reduce etc.

So my guess is you have some complex processing to get these dataframes or your initial data which you used to get this dataframe is too huge.

来源：https://stackoverflow.com/questions/45859296/sparks-dataframe-count-function-taking-very-long

标签

apache-spark

dataframe

spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!