How to merge by a column of collection using Python Pandas?

问题

I have 2 lists of Stack Overflow questions, group A and group B. Both have two columns, Id and Tag. e.g:

|Id        |Tag
| -------- | --------------------------------------------
|2         |c#,winforms,type-conversion,decimal,opacity

For each question in group A, I need to find in group B all matched questions that have at least one overlapping tag the question in group A, independent of the position of tags. For example, these questions should all be matched questions:

|Id        |Tag
|----------|---------------------------
|3         |c#
|4         |winforms,type-conversion
|5         |winforms,c#

My first thought was to convert the variable Tag into a set variable and merge using Pandas because set ignores position. However, it seems that Pandas doesn't allow a set variable to be the key variable. So I am now using for loop to search over group B. But it is extremely slow since I have 13 million observation in group B.

My question is: 1. Is there any other way in Python to merge by a column of collection and can tell the number of overlapping tags? 2. How to improve the efficiency of for loop search?

回答1:

This can be achieved using df.join and df.groupby.

This is the setup I'm working with:

df1 = pd.DataFrame({ 'Id' : [2], 'Tag' : [['c#', 'winforms', 'type-conversion', 'decimal', 'opacity']]}) 

   Id                                                Tag
0   2  [c#, winforms, type-conversion, decimal, opacity]

df2 = pd.DataFrame({ 'Id' : [3, 4, 5], 'Tag' : [['c#'], ['winforms', 'type-conversion'], ['winforms', 'c#']]})  

   Id                          Tag
0   3                         [c#]
1   4  [winforms, type-conversion]
2   5               [winforms, c#]

Let's flatten out the right column in both data frames. This helped:

In [2331]: from itertools import chain

In [2332]: def flatten(df):
      ...:     return pd.DataFrame({"Id": np.repeat(df.Id.values, df.Tag.str.len()),
      ...:                          "Tag": list(chain.from_iterable(df.Tag))})
      ...: 

In [2333]: df1 = flatten(df1)

In [2334]: df2 = flatten(df2)

In [2335]: df1.head()
Out[2335]: 
   Id              Tag
0   2               c#
1   2         winforms
2   2  type-conversion
3   2          decimal
4   2          opacity

And similarly for df2, which is also flattened.

Now is the magic. We'll do a join on Tag column, and then groupby on joined IDs to find count of overlapping tags.

In [2337]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index()
Out[2337]: 
   Id_x  Id_y  Tag
0     2     3    1
1     2     4    2
2     2     5    2

The output shows each pair of tags along with the number of overlapping tags. Pairs with no overlaps are filtered out by the groupby.

The df.count counts overlapping tags, and df.reset_index just prettifies the output, since groupby assigns the grouped column as the index, so we reset it.

To see matching tags, you'll modify the above slightly:

In [2359]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y'])['Tag'].apply(list).reset_index()
Out[2359]: 
   Id_x  Id_y                          Tag
0     2     3                         [c#]
1     2     4  [winforms, type-conversion]
2     2     5               [c#, winforms]

To filter out 1-overlaps, chain a df.query call to the first expression:

In [2367]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index().query('Tag > 1')
Out[2367]: 
   Id_x  Id_y  Tag
1     2     4    2
2     2     5    2

回答2:

Step 1 List down all tags
Step 2 create binary representation of each tag, i.e. use bit 1 or 0 to represent whether have or not have the tag
Step 3 to find any ID share the same tag, you could call a simple apply function to decode your binary representation.

In terms of processing speed, it should be all right. However, if number of tags are too huge, there might be memory issue. If you only need to find questions with same tag for one Id only, I will suggest you write a simple function and calling df.apply. If you need to check for a lot of IDs and find questions with same tag, I will say above approach will be better.

(Intended to leave it as comment, but not enough reputation... sigh)

来源：https://stackoverflow.com/questions/45024037/how-to-merge-by-a-column-of-collection-using-python-pandas

标签

python

pandas

dataframe

merge

set