Pandas read csv out of memory

后端未结

关注

 3  672

I try to manipulate a large CSV file using Pandas, when I wrote this

df = pd.read_csv(strFileName,sep=\'\\t\',delimiter=\'\\t\')

it raises

相关标签:

3条回答

天命终不由人

2020-12-10 16:15

Based on your snippet in out of memory error when reading csv file in chunk, when reading line-by-line.

I assume that kb_2 is the error indicator,

groups={}
with open("data/petaJoined.csv", "r") as large_file:
    for line in large_file:
        arr=line.split('\t')
        #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
        k=arr[0]+','+arr[1]
        if not (k in groups.keys())
            groups[k]={'record_count':0, 'error_sum': 0}
        groups[k]['record_count']=groups[k]['record_count']+1
        groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
for k,v in groups.items:
    print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))

This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.

It will encounter an out-of-memory exception, if there are too many combinations of groups.

0 讨论(0)

长发绾君心

2020-12-10 16:18

You haven't stated what your intended aggregation would be, but if it's just sum and count, then you could aggregate in chunks:

dfs = pd.DataFrame()
reader = pd.read_table(strFileName, chunksize=16*1024)  # choose as appropriate
for chunk in reader:
    temp = chunk.agg(...)  # your logic here
    dfs.append(temp)
df = dfs.agg(...)  # redo your logic here

0 讨论(0)

被撕碎了的回忆

2020-12-10 16:34
What @chrisaycock suggested is the preferred method if you need to sum or count

If you need to average, it won't work because avg(a,b,c,d) does not equal avg(avg(a,b),avg(c,d))

I suggest using a map-reduce like approach, with streaming

create a file called map-col.py
```
import sys
for line in sys.stdin:
   print (line.split('\t')[col])
```
And a file named reduce-avg.py
```
import sys
s=0
n=0
for line in sys.stdin:
   s=s+float(line)
   n=n+1
print (s/n)
```
And in order to run the whole thing:
```
cat strFileName|python map-col.py|python reduce-avg.py>output.txt
```
This method will work regardless of the size of the file, and will not run out of memory
0 讨论(0)
发布评论:

提交评论
- 加载中...