pyspark: isin vs join

北战南征 提交于 2019-12-03 06:42:15

Considering

import pyspark.sql.functions as psf

There are two types of broadcasting:

  • sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin
  • psf.broadcast inside a join to copy your pyspark dataframe to every node when the dataframe is small: df1.join(psf.broadcast(df2)). It is usually used for cartesian products (CROSS JOIN in pig).

In the context question, the filtering was done using the column of another dataframe, hence the possible solution with a join.

Keep in mind that if your filtering list is relatively big the operation of searching through it will take a while, and since it has do be done for each row it can quickly get costly.

Joins on the other hand involve two dataframes that will be sorted before matching, so if your list is small enough you might not want to have to sort a huge dataframe just for a filter.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!