问题
I have two tables - one is a core data with a pair of IDs (PC1 and P2) and some blob data (P3). The other is a blacklist data for PC1 in the former table. I will call the first table in_df and the second blacklist_df.
What I want to do is to remove rows from in_df long as in_df.PC1 == blacklist_df.P1 and in_df.P2 == black_list_df.B1. Here is a code snippet to show what I want to achieve more explicitly.
in_df = sqlContext.createDataFrame([[1,2,'A'],[2,1,'B'],[3,1,'C'],
[4,11,'D'],[1,3,'D']],['PC1','P2','P3'])
in_df.show()
+---+---+---+
|PC1| P2| P3|
+---+---+---+
| 1| 2| A|
| 2| 1| B|
| 3| 1| C|
| 4| 11| D|
| 1| 3| D|
+---+---+---+
blacklist_df = sqlContext.createDataFrame([[1,2],[2,1]],['P1','B1'])
blacklist_df.show()
+---+---+
| P1| B1|
+---+---+
| 1| 2|
| 2| 1|
+---+---+
In the end what I want to get is the followings:
+---+--+--+
|PC1|P2|P3|
+---+--+--+
| 1| 3| D|
| 3| 1| C|
| 4|11| D|
+---+--+--+
I tried LEFT_ANTI join but I haven't been successful. Thanks!
回答1:
Pass the join conditions as a list to the join
function, and specify how='left_anti'
as the join type:
in_df.join(
blacklist_df,
[in_df.PC1 == blacklist_df.P1, in_df.P2 == blacklist_df.B1],
how='left_anti'
).show()
+---+---+---+
|PC1| P2| P3|
+---+---+---+
| 1| 3| D|
| 4| 11| D|
| 3| 1| C|
+---+---+---+
来源:https://stackoverflow.com/questions/51343937/how-to-left-anti-join-under-some-matching-condition