I have a pandas dataframe of values I read in from a csv file. I have a column labeled \'SleepQuality\' and the values are float from 0.0 - 100.0. I want to create a new col
That's basically a binning operation. As such two tools could be used here.
Using np.searchsorted -
bins = np.arange(50,100,10)
df['SleepQualityGroup'] = bins.searchsorted(df.SleepQuality)
Using np.digitize -
df['SleepQualityGroup'] = np.digitize(df.SleepQuality, bins)
Sample output -
In [866]: df
Out[866]:
SleepQuality SleepQualityGroup
0 80.4 4
1 90.1 5
2 66.4 2
3 50.3 1
4 86.2 4
5 75.4 3
6 45.7 0
7 91.5 5
8 61.3 2
9 54.0 1
10 58.2 1
Runtime test -
In [921]: df
Out[921]:
SleepQuality SleepQualityGroup
0 80.4 4
1 90.1 5
2 66.4 2
3 50.3 1
4 86.2 4
5 75.4 3
6 45.7 0
7 91.5 5
8 61.3 2
9 54.0 1
10 58.2 1
In [922]: df = pd.concat([df]*10000,axis=0)
# @Dark's soln using pd.cut
In [923]: %timeit df['new'] = pd.cut(df['SleepQuality'],bins=[0,50 , 60, 70 , 80 , 90,100], labels=[0,1,2,3,4,5])
1000 loops, best of 3: 1.04 ms per loop
In [926]: %timeit df['SleepQualityGroup'] = bins.searchsorted(df.SleepQuality)
1000 loops, best of 3: 591 µs per loop
In [927]: %timeit df['SleepQualityGroup'] = np.digitize(df.SleepQuality, bins)
1000 loops, best of 3: 538 µs per loop
Use pd.cut
i.e
df['new'] = pd.cut(df['SleepQuality'],bins=[0,50 , 60, 70 , 80 , 90,100], labels=[0,1,2,3,4,5])
Output:
SleepQuality SleepQualityGroup new 0 80.4 4 4 1 90.1 5 5 2 66.4 2 2 3 50.3 1 1 4 86.2 4 4 5 75.4 3 3 6 45.7 0 0 7 91.5 5 5 8 61.3 2 2 9 54.0 1 1 10 58.2 1 1