Python Pandas Create New Bin/Bucket Variable with pd.qcut

☆樱花仙子☆ 提交于 2019-12-02 23:48:16

In Pandas 0.15.0 or newer, pd.qcut will return a Series, not a Categorical if the input is a Series (as it is, in your case) or if labels=False. If you set labels=False, then qcut will return a Series with the integer indicators of the bins as values.

So to future-proof your code, you could use

data3['bins_spd'] = pd.qcut(data3['spd_pct'], 5, labels=False)

or, pass a NumPy array to pd.qcut so you get a Categorical as the return value. Note that the Categorical attribute labels is deprecated. Use codes instead:

data3['bins_spd'] = pd.qcut(data3['spd_pct'].values, 5).codes

EDIT: The below answer is only valid for versions of Pandas less than 0.15.0. If you are running Pandas 15 or higher, see:

data3['bins_spd'] = pd.qcut(data3['spd_pct'], 5, labels=False)

Thanks to @unutbu for pointing it out. :)

Say you have some data that you want to bin, in my case options spreads, and you want to make a new variable with the buckets corresponding to each observation. The link mentioned above that you can do this by:

print pd.qcut(data3['spd_pct'], 40)

(0.087, 0.146]
(0.0548, 0.087]
(0.146, 0.5]
(0.146, 0.5]
(0.087, 0.146]
(0.0548, 0.087]
(0.5, 2]

which gives you what the bin endpoints are that correspond to each observation. However, if you would like the corresponding bin numbers for each observation then you can do this:

print pd.qcut(data3['spd_pct'],5).labels

[2 1 3 ..., 0 1 4] 

Putting it all together if you would like to create a new variable with just the bin numbers, this should suffice:


print data3.head()

   secid      date    symbol  symbol_flag     exdate   last_date cp_flag  0   5005  1/2/1997  099F2.37            0  1/18/1997         NaN       P   
1   5005  1/2/1997  09B0B.1B            0  2/22/1997   12/3/1996       P   
2   5005  1/2/1997  09B7C.2F            0  2/22/1997  12/11/1996       P   
3   5005  1/2/1997  09EE6.6E            0  1/18/1997  12/27/1996       C   
4   5005  1/2/1997  09F2F.CE            0  8/16/1997         NaN       P   

   strike_price  best_bid  best_offer     ...      close  volume_y    return  0          7500     2.875      3.2500     ...        4.5     99200  0.074627   
1         10000     5.375      5.7500     ...        4.5     99200  0.074627   
2          5000     0.625      0.8750     ...        4.5     99200  0.074627   
3          5000     0.125      0.1875     ...        4.5     99200  0.074627   
4          7500     3.000      3.3750     ...        4.5     99200  0.074627   

   cfadj_y  open  cfret  shrout      mid   spd_pct  bins_spd  
0        1   4.5      1   57735  3.06250  0.122449         2  
1        1   4.5      1   57735  5.56250  0.067416         1  
2        1   4.5      1   57735  0.75000  0.333333         3  
3        1   4.5      1   57735  0.15625  0.400000         3  
4        1   4.5      1   57735  3.18750  0.117647         2  

[5 rows x 35 columns]

Hope this helps somebody else. At the very least it should be easier to search for now. :)
