问题
Let's say I create some data and then create bins of different sizes:
from __future__ import division
x = np.random.rand(1,20)
new, = np.digitize(x,np.arange(1,x.shape[1]+1)/100)
new_series = pd.Series(new)
print(new_series.value_counts())
reveals:
20 17
16 1
4 1
2 1
dtype: int64
I basically want to transform the underlying data, if I set a minimum threshold of at least 2 per bin, so that new_series.value_counts() is this:
20 17
16 3
dtype: int64
回答1:
EDITED:
x = np.random.rand(1,100)
bins = np.arange(1,x.shape[1]+1)/100
new = np.digitize(x,bins)
n = new.copy()[0] # this will hold the the result
threshold = 2
for i in np.unique(n):
if sum(n == i) <= threshold:
n[n == i] += 1
n.clip(0, bins.size) # avoid adding beyond the last bin
n = n.reshape(1,-1)
This can move counts up multiple times, until a bin is filled sufficiently.
Instead of using np.digitize, it might be simpler to use np.histogram instead, because it will directly give you the counts, so that we don't need to sum ourselves.
来源:https://stackoverflow.com/questions/38591000/binning-and-then-combining-bins-with-minimum-number-of-observations