Pandas One hot encoding: Bundling together less frequent categories

喜你入骨 提交于 2019-12-10 23:13:18

问题


I'm doing one hot encoding over a categorical column which has some 18 different kind of values. I want to create new columns for only those values, which appear more than some threshold (let's say 1%), and create another column named other values which has 1 if value is other than those frequent values.

I'm using Pandas with Sci-kit learn. I've explored pandas get_dummies and sci-kit learn's one hot encoder, but can't figure out how to bundle together less frequent values into one column.


回答1:


plan

  • pd.get_dummies to one hot encode as normal
  • sum() < threshold to identify columns that get aggregated
    • I use pd.value_counts with the parameter normalize=True to get percentage of occurance.
  • join

def hot_mess2(s, thresh):
    d = pd.get_dummies(s)
    f = pd.value_counts(s, sort=False, normalize=True) < thresh
    if f.sum() == 0:
        return d
    else:
        return d.loc[:, ~f].join(d.loc[:, f].sum(1).rename('other'))

Consider the pd.Series s

s = pd.Series(np.repeat(list('abcdef'), range(1, 7)))

s

0     a
1     b
2     b
3     c
4     c
5     c
6     d
7     d
8     d
9     d
10    e
11    e
12    e
13    e
14    e
15    f
16    f
17    f
18    f
19    f
20    f
dtype: object

hot_mess(s, 0)

    a  b  c  d  e  f
0   1  0  0  0  0  0
1   0  1  0  0  0  0
2   0  1  0  0  0  0
3   0  0  1  0  0  0
4   0  0  1  0  0  0
5   0  0  1  0  0  0
6   0  0  0  1  0  0
7   0  0  0  1  0  0
8   0  0  0  1  0  0
9   0  0  0  1  0  0
10  0  0  0  0  1  0
11  0  0  0  0  1  0
12  0  0  0  0  1  0
13  0  0  0  0  1  0
14  0  0  0  0  1  0
15  0  0  0  0  0  1
16  0  0  0  0  0  1
17  0  0  0  0  0  1
18  0  0  0  0  0  1
19  0  0  0  0  0  1
20  0  0  0  0  0  1

hot_mess(s, .1)

    c  d  e  f  other
0   0  0  0  0      1
1   0  0  0  0      1
2   0  0  0  0      1
3   1  0  0  0      0
4   1  0  0  0      0
5   1  0  0  0      0
6   0  1  0  0      0
7   0  1  0  0      0
8   0  1  0  0      0
9   0  1  0  0      0
10  0  0  1  0      0
11  0  0  1  0      0
12  0  0  1  0      0
13  0  0  1  0      0
14  0  0  1  0      0
15  0  0  0  1      0
16  0  0  0  1      0
17  0  0  0  1      0
18  0  0  0  1      0
19  0  0  0  1      0
20  0  0  0  1      0



回答2:


How about something like the following:

create a data frame

df = pd.DataFrame(data=list('abbgcca'), columns=['x'])
df

    x
0   a
1   b
2   b
3   g
4   c 
5   c
6   a

Replace values that are present less frequently than a given threshold. I'll create a copy of the column so that I'm not modifying the original dataframe. First step is to create a dictionary of the value_counts and then replace the actual values with those counts so that they can be compared to the threshold. Set values below that threshold to 'other values' then use pd.get_dummies to get the dummy variables

#set the threshold for example 20%
thresh = 0.2
x = df.x.copy()
#replace any values present less than the threshold with 'other values'
x[x.replace(x.value_counts().to_dict()) < len(x)*thresh] = 'other values'
#get dummies
pd.get_dummies(x)

        a       b       c       other values
    0   1.0     0.0     0.0     0.0
    1   0.0     1.0     0.0     0.0
    2   0.0     1.0     0.0     0.0
    3   0.0     0.0     0.0     1.0
    4   0.0     0.0     1.0     0.0
    5   0.0     0.0     1.0     0.0
    6   1.0     0.0     0.0     0.0

Alternatively you could use Counter it may be a bit cleaner

from collections import Counter
x[x.replace(Counter(x)) < len(x)*thresh] = 'other values'


来源:https://stackoverflow.com/questions/43334222/pandas-one-hot-encoding-bundling-together-less-frequent-categories

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!