问题
I have a large csv file, which is a log of caller data.
A short snippet of my file:
CompanyName High Priority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
I want to sort the entire list by the frequency of occurrence of customers so it will be like:
CompanyName High Priority QualityIssue
Customer3 No Equipment
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer1 Yes User
Customer1 Yes User
Customer1 No Neither
Customer2 No User
Customer4 No User
I've tried groupby, but that only prints out the Company Name and the frequency but not the other columns, I also tried
df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
and
df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
But these give me errors: ValueError: Wrong number of items passed 1, indices imply 24
I've looked at something like this:
for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
but this only prints out two columns, and I want to sort my entire csv. My output should be my entire csv sorted by the first column.
Thanks for the help in advance!
回答1:
This seems to do what you want, basically add a count column by performing a groupby and transform with value_counts and then you can sort on that column:
In [22]:
df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
CompanyName HighPriority QualityIssue count
5 Customer3 No User 4
3 Customer3 No Equipment 4
7 Customer3 Yes Equipment 4
6 Customer3 Yes User 4
0 Customer1 Yes User 3
4 Customer1 No Neither 3
1 Customer1 Yes User 3
8 Customer4 No User 1
2 Customer2 No User 1
You can drop the extraneous column using df.drop:
In [24]:
df.drop('count', axis=1)
Out[24]:
CompanyName HighPriority QualityIssue
5 Customer3 No User
3 Customer3 No Equipment
7 Customer3 Yes Equipment
6 Customer3 Yes User
0 Customer1 Yes User
4 Customer1 No Neither
1 Customer1 Yes User
8 Customer4 No User
2 Customer2 No User
回答2:
The top-voted answer needs a minor addition: sort
was deprecated in favour of sort_values
and sort_index
.
sort_values
will work like this:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
df['count'] = \
df.groupby('a')['a']\
.transform(pd.Series.value_counts)
df.sort_values('count', inplace=True, ascending=False)
print('df sorted: \n{}'.format(df))
df sorted: a b count 0 1 1 2 2 1 3 2 1 2 2 1
回答3:
I think there must be a better way to do it, but this should work:
Preparing the data:
data = """
CompanyName HighPriority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s*")
And doing the transformation:
# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())
# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")
# output the original data frame in the order of the new index.
df.reindex(new_index.index)
The output:
CompanyName HighPriority QualityIssue
3 Customer3 No Equipment
5 Customer3 No User
6 Customer3 Yes User
7 Customer3 Yes Equipment
0 Customer1 Yes User
1 Customer1 Yes User
4 Customer1 No Neither
8 Customer4 No User
2 Customer2 No User
It's probably not intuitive what happens here, but at the moment I cannot think of a better way to do it. I tried to comment as much as possible.
The tricky part here is that the index of count_df
is the (unique) occurrences of the customers. Therefore, I join the index of count_df
(left_index=True
) with the CompanyName
column of df
(right_on="CompanyName"
).
The magic here is that count_df
is already sorted by the number of occurrences, that's why we need no explicit sorting. So all we have to do is to reorder the rows of the original data frame by the rows of the joined data frame and we get the expected result.
来源:https://stackoverflow.com/questions/30787391/sorting-entire-csv-by-frequency-of-occurence-in-one-column