问题
I have a CSV file with the following contents in 3 columns and 11 rows, the first row being a header. I created this myself to have a simple file to learn from. Each line item is a single order of fruit.
OrderNo Fruit Origin
1 Apple NY
2 Orange FL
3 Banana CA
4 Pear NJ
5 Grapes VA
6 Grapes VA
7 Grapes MD
8 Grapes MA
9 Pineapple HI
10 Grapes GA
I am trying to parse this data in Python, to do the following:
(1) determine the states that generate the most orders for each type of fruit and (2) determine the highest number of orders from any single state per each fruit, (3) output this result in alphabetical order like so:
Apple NY 1
Banana CA 1
Grapes VA 2
Orange FL 1
Pear NJ 1
Pineapple HI 1
After reading the csv file with csv.reader, I was trying to accomplish the counting with Counter and for loops:
import csv
from collections import Counter
cnt = Counter()
f = open("/test.csv")
reader = csv.reader(f, delimiter=",")
header = next(f)
for row in reader:
cnt[row[2]] += 1
But is there a better way?
回答1:
I'd actually use pandas which is a combination of list/dictionary/spreadsheet/database. It is specifically designed for manipulating data in this way.
import pandas as pd
from collections import defaultdict
path_to_file = "/test.csv"
df = pd.read_csv(path_to_file)
groups = df.groupby(['Fruit', 'Origin'])
max_for_fruit = defaultdict(int) #first pass through the groups, store the maximum for each fruit to handle ties
for g in groups:
fruit, count = g[0][0], len(g[1])
max_for_fruit[ fruit ] = max( max_for_fruit[fruit], count )
for g in groups:
fruit, state, count = g[0][0], g[0][1], len(g[1])
if count == max_for_fruit[ fruit ]:
print( "{} {} {}".format(fruit, state, count ) )
And here is the output.
Apple NY 1
Banana CA 1
Grapes VA 2
Orange FL 1
Pear NJ 1
Pineapple HI 1
http://pandas.pydata.org/pandas-docs/stable/groupby.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/tutorials.html
回答2:
Your approach is fine and will work great, but you will need two levels of nesting (by fruit and counts by state. One other little improvement is to used named tuples for clarity:
import csv
from collections import Counter, namedtuple, defaultdict
from itertools import imap
reader = csv.reader(data)
Order = namedtuple('Order', next(reader))
state_orders = defaultdict(Counter)
for order in imap(Order._make, reader):
state_orders[order.fruit][order.origin] += 1
for fruit, counts_by_state in sorted(state_orders.items()):
state, cnt = counts_by_state.most_common(1)[0]
print '%s is ordered most by %s with %s orders' % (fruit, state, cnt)
While dicts and counters handle this kind of problem easily, you're probably better-off using the sqlite3 module. SQL was born to solve these kinds of problems:
import csv
import sqlite3
reader = csv.reader(data)
header = next(reader)
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE Orders (order_no integer, fruit text, origin text)')
c.executemany('INSERT INTO Orders VALUES (?,?,?)', reader)
c.execute('''CREATE VIEW StateOrders AS
SELECT fruit, origin, COUNT(*) as cnt
FROM Orders GROUP BY fruit, origin ''')
for fruit, state, cnt in c.execute('''
SELECT fruit, origin, cnt
FROM StateOrders AS Outer
WHERE cnt = (SELECT MAX(cnt) FROM StateOrders WHERE Outer.fruit = fruit)
ORDER BY FRUIT '''):
print '%s is ordered most by %s with %s orders' % (fruit, state, cnt)
来源:https://stackoverflow.com/questions/25837877/python-how-best-to-parse-csv-and-count-values-for-only-a-subset