Spark DataSet filter performance

我是研究僧i 提交于 2019-12-03 08:22:14

It's because of step 3 here.

In the first two, spark doesn't need to deserialize the whole Java/Scala object - it just looks at the one column and moves on.

In the third, since you're using a lambda function, spark can't tell that you just want the one field, so it pulls all 33 fields out of memory for each row, so that you can check the one field.

I'm not sure why the fourth is so slow. It seems like it would work the same way as the first.

When running python what is happening is that first your code is loaded onto the JVM, interpreted, and then its finally compiled into bytecode. When using the Scala API, Scala natively runs on the JVM so you're cutting out the entire load python code into the JVM part.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!