multi-column factorize in pandas

后端未结

关注

 4  730

The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.

相关标签:

4条回答

时光取名叫无心

2020-12-28 09:46

df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]

0 讨论(0)

不思量自难忘°

2020-12-28 09:54
You need to create a ndarray of tuple first, pandas.lib.fast_zip can do this very fast in cython loop.
```
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]
```
the output is:
```
[0 1 2 2 1 0]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

自闭症患者

2020-12-28 10:00

I am not sure if this is an efficient solution. There might be better solutions for this.

arr=[] #this will hold the unique items of the dataframe
for i in df.index:
   if list(df.iloc[i]) not in arr:
      arr.append(list(df.iloc[i]))

so printing the arr would give you

>>>print arr
[[1,1],[1,2],[2,2]]

to hold the indices, i would declare an ind array

ind=[]
for i in df.index:
   ind.append(arr.index(list(df.iloc[i])))

printing ind would give

 >>>print ind
 [0,1,2,2,1,0]

0 讨论(0)

醉话见心

2020-12-28 10:08

You can use drop_duplicates to drop those duplicated rows

In [23]: df.drop_duplicates()
Out[23]: 
      x  y
   0  1  1
   1  1  2
   2  2  2

EDIT

To achieve your goal, you can join your original df to the drop_duplicated one:

In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]: 
   x  y  index
0  1  1      0
1  1  2      1
2  2  2      2
3  2  2      2
4  1  2      1
5  1  1      0

0 讨论(0)