Spark efficiently filtering entries from big dataframe that exist in a small dataframe

自作多情 提交于 2021-01-27 07:44:02

问题


I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id

What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe

I was wondering what would be an efficient solution to get this working and why? Several options I had in mind:

  1. Broadcast join the 2 dataframes
  2. Broadcast the small dataframe and collect it as an array of strings and then filter on the big dataframe and use isin with the array of strings

Are there any other options that I didn't mention here?

I'll appreciate it if someone could also explain why a specific solution is more efficient than the other

Thanks in advance


回答1:


AFAIK its all depends on the size of data you are handling and performance ,

  • if you use broadcast function then default size is 10mb (for your small dataframe via spark.sql.autobroadcastjointhreshhold see my answer ) you can increase or decrease the size based on your data. Also, braodcasted data will be part of sql execution plan and further will be pointer to catalyst optimizer to do further optimization. Also see my answer here

  • where as broadcast shared variable (which you want to use isin) doesnt have above advantage.

pls see my answer in above link in my comment



来源:https://stackoverflow.com/questions/40717062/spark-efficiently-filtering-entries-from-big-dataframe-that-exist-in-a-small-dat

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!