pandas.factorize on an entire data frame

后端 未结 3 1384
青春惊慌失措
青春惊慌失措 2020-12-07 15:13

pandas.factorize encodes input values as an enumerated type or categorical variable.

But how can I easily and efficiently convert many columns of a data frame? What

3条回答
  •  萌比男神i
    2020-12-07 15:59

    You can use apply if you need to factorize each column separately:

    df = pd.DataFrame({'A':['type1','type2','type2'],
                       'B':['type1','type2','type3'],
                       'C':['type1','type3','type3']})
    
    print (df)
           A      B      C
    0  type1  type1  type1
    1  type2  type2  type3
    2  type2  type3  type3
    
    print (df.apply(lambda x: pd.factorize(x)[0]))
       A  B  C
    0  0  0  0
    1  1  1  1
    2  1  2  1
    

    If you need for the same string value the same numeric one:

    print (df.stack().rank(method='dense').unstack())
         A    B    C
    0  1.0  1.0  1.0
    1  2.0  2.0  3.0
    2  2.0  3.0  3.0
    

    If you need to apply the function only for some columns, use a subset:

    df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()
    print (df)
           A    B    C
    0  type1  1.0  1.0
    1  type2  2.0  3.0
    2  type2  3.0  3.0
    

    Solution with factorize:

    stacked = df[['B','C']].stack()
    df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()
    print (df)
           A  B  C
    0  type1  0  0
    1  type2  1  2
    2  type2  2  2
    

    Translate them back is possible via map by dict, where you need to remove duplicates by drop_duplicates:

    vals = df.stack().drop_duplicates().values
    b = [x for x in df.stack().drop_duplicates().rank(method='dense')]
    
    d1 = dict(zip(b, vals))
    print (d1)
    {1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}
    
    df1 = df.stack().rank(method='dense').unstack()
    print (df1)
         A    B    C
    0  1.0  1.0  1.0
    1  2.0  2.0  3.0
    2  2.0  3.0  3.0
    
    print (df1.stack().map(d1).unstack())
           A      B      C
    0  type1  type1  type1
    1  type2  type2  type3
    2  type2  type3  type3
    

提交回复
热议问题