问题
I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id
What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe
I was wondering what would be an efficient solution to get this working and why? Several options I had in mind:
- Broadcast join the 2 dataframes
- Broadcast the small dataframe and collect it as an array of strings and then filter on the big dataframe and use isin with the array of strings
Are there any other options that I didn't mention here?
I'll appreciate it if someone could also explain why a specific solution is more efficient than the other
Thanks in advance
回答1:
AFAIK its all depends on the size of data you are handling and performance ,
if you use
broadcast
function then default size is 10mb (for your small dataframe viaspark.sql.autobroadcastjointhreshhold
see my answer ) you can increase or decrease the size based on your data. Also, braodcasted data will be part of sql execution plan and further will be pointer to catalyst optimizer to do further optimization. Also see my answer herewhere as broadcast shared variable (which you want to use
isin
) doesnt have above advantage.
pls see my answer in above link in my comment
来源:https://stackoverflow.com/questions/40717062/spark-efficiently-filtering-entries-from-big-dataframe-that-exist-in-a-small-dat