Joining a large and a ginormous spark dataframe
I have two dataframes, df1 has 6 million rows, df2 has 1 billion. I have tried the standard df1.join(df2,df1("id")<=>df2("id2")) , but run out of memory. df1 is too large to be put into a broadcast join. I have even tried a bloom filter, but it was also too large to fit in a broadcast and still be useful. The only thing I have tried that doesn't error out is to break df1 into 300,000 row chunks and join with df2 in a foreach loop. But this takes an order of magnitude longer than it probably should (likely because it is too large to fit as a persist causing it to redo the split upto that point)