multi-column factorize in pandas

后端 未结 4 725
长发绾君心
长发绾君心 2020-12-28 09:35

The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.

相关标签:
4条回答
  • 2020-12-28 09:46
    df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
    tuples = df[['x', 'y']].apply(tuple, axis=1)
    df['newID'] = pd.factorize( tuples )[0]
    
    0 讨论(0)
  • 2020-12-28 09:54

    You need to create a ndarray of tuple first, pandas.lib.fast_zip can do this very fast in cython loop.

    import pandas as pd
    df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
    print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]
    

    the output is:

    [0 1 2 2 1 0]
    
    0 讨论(0)
  • 2020-12-28 10:00

    I am not sure if this is an efficient solution. There might be better solutions for this.

    arr=[] #this will hold the unique items of the dataframe
    for i in df.index:
       if list(df.iloc[i]) not in arr:
          arr.append(list(df.iloc[i]))
    

    so printing the arr would give you

    >>>print arr
    [[1,1],[1,2],[2,2]]
    

    to hold the indices, i would declare an ind array

    ind=[]
    for i in df.index:
       ind.append(arr.index(list(df.iloc[i])))
    

    printing ind would give

     >>>print ind
     [0,1,2,2,1,0]
    
    0 讨论(0)
  • 2020-12-28 10:08

    You can use drop_duplicates to drop those duplicated rows

    In [23]: df.drop_duplicates()
    Out[23]: 
          x  y
       0  1  1
       1  1  2
       2  2  2
    

    EDIT

    To achieve your goal, you can join your original df to the drop_duplicated one:

    In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
    Out[46]: 
       x  y  index
    0  1  1      0
    1  1  2      1
    2  2  2      2
    3  2  2      2
    4  1  2      1
    5  1  1      0
    
    0 讨论(0)
提交回复
热议问题