Python: How best to parse csv and count values for only a subset

醉酒当歌 提交于 2021-02-08 09:30:31

问题


I have a CSV file with the following contents in 3 columns and 11 rows, the first row being a header. I created this myself to have a simple file to learn from. Each line item is a single order of fruit.

OrderNo      Fruit     Origin
1           Apple        NY
2           Orange       FL      
3           Banana       CA
4           Pear         NJ
5           Grapes       VA
6           Grapes       VA
7           Grapes       MD
8           Grapes       MA
9           Pineapple    HI
10          Grapes       GA

I am trying to parse this data in Python, to do the following:

(1) determine the states that generate the most orders for each type of fruit and (2) determine the highest number of orders from any single state per each fruit, (3) output this result in alphabetical order like so:

Apple NY 1
Banana CA 1
Grapes VA 2
Orange FL 1
Pear NJ 1
Pineapple HI 1

After reading the csv file with csv.reader, I was trying to accomplish the counting with Counter and for loops:

import csv
from collections import Counter 

cnt = Counter()
f = open("/test.csv")
reader = csv.reader(f, delimiter=",")
header = next(f) 

for row in reader:   
    cnt[row[2]] += 1 

But is there a better way?


回答1:


I'd actually use pandas which is a combination of list/dictionary/spreadsheet/database. It is specifically designed for manipulating data in this way.

import pandas as pd
from collections import defaultdict

path_to_file = "/test.csv"
df = pd.read_csv(path_to_file)

groups = df.groupby(['Fruit', 'Origin'])
max_for_fruit = defaultdict(int) #first pass through the groups, store the maximum for each fruit to handle ties

for g in groups:
    fruit, count = g[0][0], len(g[1])
    max_for_fruit[ fruit ] = max( max_for_fruit[fruit], count )

for g in groups:
    fruit, state, count = g[0][0], g[0][1], len(g[1])
    if count == max_for_fruit[ fruit ]:
        print( "{} {} {}".format(fruit, state, count ) )

And here is the output.

Apple NY 1
Banana CA 1
Grapes VA 2
Orange FL 1
Pear NJ 1
Pineapple HI 1

http://pandas.pydata.org/pandas-docs/stable/groupby.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

http://pandas.pydata.org/pandas-docs/stable/tutorials.html




回答2:


Your approach is fine and will work great, but you will need two levels of nesting (by fruit and counts by state. One other little improvement is to used named tuples for clarity:

import csv
from collections import Counter, namedtuple, defaultdict
from itertools import imap

reader = csv.reader(data)
Order = namedtuple('Order', next(reader))

state_orders = defaultdict(Counter)
for order in imap(Order._make, reader):
    state_orders[order.fruit][order.origin] += 1

for fruit, counts_by_state in sorted(state_orders.items()):
    state, cnt = counts_by_state.most_common(1)[0]
    print '%s is ordered most by %s with %s orders' % (fruit, state, cnt)

While dicts and counters handle this kind of problem easily, you're probably better-off using the sqlite3 module. SQL was born to solve these kinds of problems:

import csv
import sqlite3

reader = csv.reader(data)
header = next(reader)

conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE Orders (order_no integer, fruit text, origin text)')
c.executemany('INSERT INTO Orders VALUES (?,?,?)', reader)
c.execute('''CREATE VIEW StateOrders AS
             SELECT fruit, origin, COUNT(*) as cnt
             FROM Orders GROUP BY fruit, origin ''')

for fruit, state, cnt in c.execute('''
    SELECT fruit, origin, cnt
    FROM StateOrders AS Outer
    WHERE cnt = (SELECT MAX(cnt) FROM StateOrders WHERE Outer.fruit = fruit)
    ORDER BY FRUIT '''):
        print '%s is ordered most by %s with %s orders' % (fruit, state, cnt)


来源:https://stackoverflow.com/questions/25837877/python-how-best-to-parse-csv-and-count-values-for-only-a-subset

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!