Spark efficiently filtering entries from big dataframe that exist in a small dataframe

问题

I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id

What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe

I was wondering what would be an efficient solution to get this working and why? Several options I had in mind:

Broadcast join the 2 dataframes
Broadcast the small dataframe and collect it as an array of strings and then filter on the big dataframe and use isin with the array of strings

Are there any other options that I didn't mention here?

I'll appreciate it if someone could also explain why a specific solution is more efficient than the other

Thanks in advance

回答1:

AFAIK its all depends on the size of data you are handling and performance ,

if you use broadcast function then default size is 10mb (for your small dataframe via spark.sql.autobroadcastjointhreshhold see my answer ) you can increase or decrease the size based on your data. Also, braodcasted data will be part of sql execution plan and further will be pointer to catalyst optimizer to do further optimization. Also see my answer here
where as broadcast shared variable (which you want to use isin) doesnt have above advantage.

pls see my answer in above link in my comment

来源：https://stackoverflow.com/questions/40717062/spark-efficiently-filtering-entries-from-big-dataframe-that-exist-in-a-small-dat

标签

performance

join

apache-spark

apache-spark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!