pandas Series.value_counts returns inconsistent order for equal count strings

When I run the code below:

s = pandas.Series(['c', 'a', 'b', 'a', 'b'])
print(s.value_counts())

Sometimes I get this:

a    2
b    2
c    1
dtype: int64

And sometimes I get this:

b    2
a    2
c    1
dtype: int64

e.g. the index order returned for equivalent counts is not the same. I couldn't reproduce this if the Series values are integers instead of strings.

Why does this happen, and what is the most efficient way to get the same index order every time?

I want it to still be sorted in descending order by counts, but to be consistent in the order of equivalent-counts items.

I'm running Python 3.7.0 and pandas 0.23.4

You have a few options to sort consistently given a series:

s = pd.Series(['a', 'b', 'a', 'c', 'c'])
c = s.value_counts()

sort by index

Use pd.Series.sort_index:

res = c.sort_index()

a    2
b    1
c    2
dtype: int64

sort by count (arbitrary for ties)

For descending counts, do nothing, as this is the default. Otherwise, you can use pd.Series.sort_values, which defaults to ascending=True. In either case, you should make no assumptions on how ties are handled.

res = c.sort_values()

b    1
c    2
a    2
dtype: int64

More efficiently, you can use c.iloc[::-1] to reverse the order.

sort by count and then by index

You can use numpy.lexsort to sort by count and then by index. Note the reverse order, i.e. -c.values is used first for sorting.

res = c.iloc[np.lexsort((c.index, -c.values))]

a    2
c    2
b    1
dtype: int64

Adding a reindex after value_counts

df.value_counts().reindex(df.unique())
Out[353]: 
a    1
b    1
dtype: int64

Update

s.value_counts().sort_index().sort_values()

You could use sort_index:

print(df.value_counts().sort_index())

Output:

a    1
b    1
dtype: int64

Please see the documentation if you want to use parameters (like ascending=True etc.)

sort_index vs reindex(df.unique()) (as suggested by @Wen) seem to be perform quite similar:

df.value_counts().sort_index():         1000 loops, best of 3: 636 µs per loop
df.value_counts().reindex(df.unique()): 1000 loops, best of 3: 880 µs per loop

来源：https://stackoverflow.com/questions/51933763/pandas-series-value-counts-returns-inconsistent-order-for-equal-count-strings

标签

python

pandas

sorting

numpy

counter