问题
I have multiple pandas dataframes, to keep it simple, let's say I have three.
>> df1=
col1 col2
id1 A B
id2 C D
id3 B A
id4 E F
>> df2=
col1 col2
id1 B A
id2 D C
id3 M N
id4 F E
>> df3=
col1 col2
id1 A B
id2 D C
id3 N M
id4 E F
The result needed is :
>> df=
col1 col2
id1 A B
id2 C D
id3 E F
Because the pairs (A, B),(C, D),(E, F) appear in all the data frames although it may be reversed.
While using pandas merge it just considers the way columns are passed. To check my observation I tried the following code for two data frames:
df1['reverse_1'] = (df1.col1+df1.col2).isin(df2.col1 + df2.col2)
df1['reverse_2'] = (df1.col1+df1.col2).isin(df2.col2 + df2.col1)
And I found that the results differ:
col1 col2 reverse_1 reverse_2
a b False True
c d False True
b a True False
e f False True
So, if I collect 'True' values from both reverse_1 and reverse_2 columns, I can get the intersect of both the data frames. Even if I do it for two data frames it's not clear to me how to proceed with more data frames (more than two). I am little confused about that. Any suggestions?
回答1:
You can create list of DataFrame
s and in list comprehension sorting per rows with removing duplicates:
dfs = [df1,df2,df3]
L = [pd.DataFrame(np.sort(x.values, axis=1), columns=x.columns).drop_duplicates()
for x in dfs]
print (L)
[ col1 col2
0 A B
1 C D
3 E F, col1 col2
0 A B
1 C D
2 M N
3 E F, col1 col2
0 A B
1 C D
2 M N
3 E F]
And then merge list of DataFrames by all columns (no parameter on
):
from functools import reduce
df = reduce(lambda left,right: pd.merge(left,right), L)
print (df)
col1 col2
0 A B
1 C D
2 E F
Another solution for @pygo:
Create index
by frozenset
s and join together by concat with inner
join, last remove duplicates by index by duplicated with boolean indexing and iloc for get first 2 columns:
df = pd.concat([x.set_index(x.apply(frozenset, axis=1)) for x in dfs], axis=1, join='inner')
df = df.iloc[~df.index.duplicated(), :2]
print (df)
col1 col2
(B, A) A B
(C, D) C D
(F, E) E F
回答2:
Somewhat similar to some of the earlier answers.
import pandas as pd
from io import StringIO
# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1 A B
id2 C D
id3 B A
id4 E F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1 B A
id2 D C
id3 M N
id4 F E
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1 A B
id2 D C
id3 N M
id4 E F
"""), delim_whitespace = True)
# List of n dataframes
dfs = [df1, df2, df3]
# Use frozenset to define the column values without regard for order
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in alist of Series named 'combined'
#[0 (B, A)
# 1 (D, C)
# 2 (B, A)
# 3 (F, E)
# Name: combined, dtype: object,
# 0 (B, A)
# 1 (D, C)
# 2 (N, M)
# 3 (E, F)
# Name: combined, dtype: object,
# 0 (B, A)
# 1 (D, C)
# 2 (M, N)
# 3 (F, E)
# Name: combined, dtype: object]
dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[ id col1 col2 combined
# 0 id1 A B (B, A)
# 1 id2 C D (D, C)
# 2 id3 B A (B, A)
# 3 id4 E F (F, E),
# id col1 col2 combined
# 0 id1 B A (B, A)
# 1 id2 D C (D, C)
# 2 id3 M N (N, M)
# 3 id4 F E (E, F),
# id col1 col2 combined
# 0 id1 A B (B, A)
# 1 id2 D C (D, C)
# 2 id3 N M (M, N)
# 3 id4 E F (F, E)]
# The reduce function operates on pairs, with previous result as the first argument
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
# id col1 col2 combined
#0 id1 A B (B, A)
#1 id2 C D (D, C)
#3 id4 E F (F, E)
来源:https://stackoverflow.com/questions/53521066/how-to-find-the-intersection-of-a-pair-of-columns-in-multiple-pandas-dataframes