发表新帖

发表新帖

Pandas read csv out of memory

后端未结

关注

 3  680

迷失自我 2020-12-10 15:45

I try to manipulate a large CSV file using Pandas, when I wrote this

df = pd.read_csv(strFileName,sep=\'\\t\',delimiter=\'\\t\')

it raises

3条回答

被撕碎了的回忆 (楼主)

2020-12-10 16:34
What @chrisaycock suggested is the preferred method if you need to sum or count

If you need to average, it won't work because avg(a,b,c,d) does not equal avg(avg(a,b),avg(c,d))

I suggest using a map-reduce like approach, with streaming

create a file called map-col.py
```
import sys
for line in sys.stdin:
   print (line.split('\t')[col])
```
And a file named reduce-avg.py
```
import sys
s=0
n=0
for line in sys.stdin:
   s=s+float(line)
   n=n+1
print (s/n)
```
And in order to run the whole thing:
```
cat strFileName|python map-col.py|python reduce-avg.py>output.txt
```
This method will work regardless of the size of the file, and will not run out of memory
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题