I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m
A short, pythonic solution to balance a pandas DataFrame either by subsampling (uspl=True) or oversampling (uspl=False), balanced by a specified column in that dataframe that has two or more values.
For uspl=True, this code will take a random sample without replacement of size equal to the smallest stratum from all strata. For uspl=False, this code will take a random sample with replacement of size equal to the largest stratum from all strata.
def balanced_spl_by(df, lblcol, uspl=True):
datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
lsz = [f.shape[0] for f in datas_l ]
return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1)
This will only work with a Pandas DataFrame, but that seems to be a common application, and restricting it to Pandas DataFrames significantly shortens the code as far as I can tell.