When using cut in a pandas dataframe to bin it, why is the binning not properly done?

为君一笑 提交于 2021-02-19 07:40:29

问题


I have a dataframe that I want to bin (i.e., group into sub-ranges) by one column, and take the mean of the second column for each of the bins:

import pandas as pd
import numpy as np

data = pd.DataFrame(columns=['Score', 'Age'])
data.Score = [1, 1, 1, 1, 0, 1, 2, 1, 0, 1, 1, 0, 2, 1, 1, 2, 1, 0, 1, 1, -1, 1, 0, 1, 1, 0, 1, 0, -2, 1]
data.Age = [29, 59, 44, 52, 60, 53, 45, 47, 57, 54, 35, 32, 48, 31, 49, 43, 67, 32, 31, 42, 37, 45, 52, 59, 56, 57, 48, 45, 56, 31]

_, bins = np.histogram(data.Age, 10)
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
labels[0] = '{}-{}'.format(bins[0], bins[1])
binned = pd.cut(data.Age, bins=bins, labels=labels, include_lowest=True, precision=0)
df = data.groupby(binned)['Score'].mean().reset_index()
df

There are 2 issues with this binning:

  1. there is a gap of 1 between the upper bound of the (n-1)th bin and the lower bound of the nth bin (which means the binning is not continuous, and data points that lie in this gap are skipped).
  2. the last few bin limits have a lot of digits after the decimal place. I have used the precision=0 flag in the cut, but it seems to be of no use - no matter what x I use in precision=x, it still produces the bins with the last few bins having a lot of digits after the decimal point.

The second point causes problem when, for instance, I try to plot df, where it ruins the look of the x-axis:

import matplotlib.pyplot as plt
plt.plot([str(i) for i in df.Age], df.Score, 'o-')

Why is this occurring inspite of the precision=0 flag that I put to imply I want only integers as the bin limits, and not floats? And how do I fix it?


I'm temporarily solving this issue by converting the bin values to ints manually:

_, bins = np.histogram(data.Age, 10)
for i in range(len(bins)): # my fix
    bins[i] = int(bins[i])
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
labels[0] = '{}-{}'.format(bins[0], bins[1])
binned = pd.cut(data.Age, bins=bins, labels=labels, include_lowest=True, precision=0)
df = data.groupby(binned)['Score'].mean().reset_index()
df

But this feels like a hack, and I think it should have a "proper" solution instead of a hacky fix. And although it fixed the second issue, I'm not sure if this fixes the first issue.


回答1:


Regarding the two issues you mentioned in your question, both of them result from one line in your code which is

labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]

The gab resulted from i+1, also the digits resulted from computer approximation in the same line.

Therefore, modify it to

labels = [f'{i:.1f}-{j:.1f}' for i, j in zip(bins[:-1], bins[1:])]

in which we make an approximation to one digit.

and no need for labels[0] = '{}-{}'.format(bins[0], bins[1])



来源:https://stackoverflow.com/questions/51777825/when-using-cut-in-a-pandas-dataframe-to-bin-it-why-is-the-binning-not-properly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!