I have a large data set in the following format:
id, socialmedia
1, facebook
2, facebook
3, google
4, google
5, google
6, twitter
7, google
8, twitter
9, sn
We could also create a dictionary and map it:
import pandas as pd
df = pd.DataFrame(dict(id=range(1,5),social=["Facebook","Twitter","Facebook","Google"]))
d = dict((k,v) for v,k in enumerate(df['social'].unique(),1))
df['groupid'] = df['social'].map(m)
print(df)
Returns
id social groupid
0 1 Facebook 1
1 2 Twitter 2
2 3 Facebook 1
3 4 Google 3
Or one-line like this:
df['groupid'] = df['social'].map({k:v for v,k in enumerate(df['social'].unique(),1)})
Timings:
%timeit df['grpId']=df.groupby('social').ngroup().add(1)
%timeit df['grpId']=pd.factorize(df['social'])[0]+1
%timeit df['grpId']=df['social'].astype('category').cat.codes.add(1)
%timeit df['groupid'] = df['social'].map(dict((k,v) for v,k in enumerate(df['social'].unique(),1)))
Returns
100 loops, best of 3: 1.5 ms per loop <- Wen1
1000 loops, best of 3: 493 µs per loop <- Wen2
1000 loops, best of 3: 990 µs per loop <- Wen3
1000 loops, best of 3: 802 µs per loop <- Antonvbr