How to get the number of the most frequent value in a column?

问题

I have a data frame and I would like to know how many times a given column has the most frequent value.

I try to do it in the following way:

items_counts = df['item'].value_counts()
max_item = items_counts.max()

As a result I get:

ValueError: cannot convert float NaN to integer

As far as I understand, with the first line I get series in which the values from a column are used as key and frequency of these values are used as values. So, I just need to find the largest value in the series and, because of some reason, it does not work. Does anybody know how this problem can be solved?

回答1:

It looks like you may have some nulls in the column. You can drop them with df = df.dropna(subset=['item']). Then df['item'].value_counts().max() should give you the max counts, and df['item'].value_counts().idxmax() should give you the most frequent value.

回答2:

To continue to @jonathanrocher answer you could use mode in pandas DataFrame. It'll give a most frequent values (one or two) across the rows or columns:

import pandas as pd
import numpy as np
df = pd.DataFrame({"a": [1,2,2,4,2], "b": [np.nan, np.nan, np.nan, 3, 3]})

In [2]: df.mode()
Out[2]: 
   a    b
0  2  3.0

回答3:

You may also consider using scipy's mode function which ignores NaN. A solution using it could look like:

from scipy.stats import mode
from numpy import nan
df = DataFrame({"a": [1,2,2,4,2], "b": [nan, nan, nan, 3, 3]})
print mode(df)

The output would look like

(array([[ 2.,  3.]]), array([[ 3.,  2.]]))

meaning that the most common values are 2 for the first columns and 3 for the second, with frequencies 3 and 2 respectively.

回答4:

Just take the first row of your items_counts series:

top = items_counts.head(1)  # or items_counts.iloc[[0]]
value, count = top.index[0], top.iat[0]

This works because pd.Series.value_counts has sort=True by default and so is already ordered by counts, highest count first. Extracting a value from an index by location has O(1) complexity, while pd.Series.idxmax has O(n) complexity where n is the number of categories.

Specifying sort=False is still possible and then idxmax is recommended:

items_counts = df['item'].value_counts(sort=False)
top = items_counts.loc[[items_counts.idxmax()]]
value, count = top.index[0], top.iat[0]

Notice in this case you don't need to call max and idxmax separately, just extract the index via idxmax and feed to the loc label-based indexer.

回答5:

Add this line of code to find the most frequent value

df["item"].value_counts().nlargest(n=1).values[0]

来源：https://stackoverflow.com/questions/15138973/how-to-get-the-number-of-the-most-frequent-value-in-a-column

标签

python

pandas

counter

frequency

series