Pyspark isin function

匿名 (未验证) 提交于 2019-12-03 01:38:01

问题:

I am a beginner is Spark.I am converting my legacy Python code to Spark using Pyspark.

I would like to get a Pyspark equivalent of the code below

usersofinterest = actdataall[actdataall['ORDValue'].isin(orddata['ORDER_ID'].unique())]['User ID'] 

Both, actdataall and orddata are Spark dataframes.

I don't want to use toPandas() function given the drawback associated with it.

Any help is appreciated.

回答1:

  • If both dataframes are big, you should consider using an inner join which will work as a filter:

    First let's create a dataframe containing the order IDs we want to keep:

    orderid_df = orddata.select(orddata.ORDER_ID.alias("ORDValue")).distinct() 

    Now let's join it with our actdataall dataframe:

    usersofinterest = actdataall.join(orderid_df, "ORDValue", "inner").select('User ID').distinct() 
  • If your target list of order IDs is small then you can use the pyspark.sql isin function as mentioned in furianpandit's post, don't forget to broadcast your variable before using it (spark will copy the object to every node making their tasks a lot faster):

    orderid_list = orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0] sc.broadcast(orderid_list) 


回答2:

The most direct translation of your code would be:

from pyspark.sql import functions as F  # collect all the unique ORDER_IDs to the driver order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()]  # filter ORDValue column by list of order_ids, then select only User ID column usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids)).select('User ID') 

However, you should only filter like this only if number of 'ORDER_ID' is definitely small (perhaps <100,000 or so).

If the number of 'ORDER_ID's is large, you should use a broadcast variable which sends the list of order_ids to each executor so it can compare against the order_ids locally for faster processing. Note, this will work even if 'ORDER_ID' is small.

order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()] order_ids_broadcast = sc.broadcast(order_ids)  # send to broadcast variable usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids_broadcast.value)).select('User ID') 

For more information on broadcast variables, check out: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-broadcast.html



回答3:

So, you have two spark dataframe. One is actdataall and other is orddata, then use following command to get your desire result.

usersofinterest  = actdataall.where(actdataall['ORDValue'].isin(orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0])).select('User ID') 


文章来源: Pyspark isin function
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!