Pandas read csv out of memory

后端 未结 3 680
迷失自我
迷失自我 2020-12-10 15:45

I try to manipulate a large CSV file using Pandas, when I wrote this

df = pd.read_csv(strFileName,sep=\'\\t\',delimiter=\'\\t\')

it raises

3条回答
  •  被撕碎了的回忆
    2020-12-10 16:34

    What @chrisaycock suggested is the preferred method if you need to sum or count

    If you need to average, it won't work because avg(a,b,c,d) does not equal avg(avg(a,b),avg(c,d))

    I suggest using a map-reduce like approach, with streaming

    create a file called map-col.py

    import sys
    for line in sys.stdin:
       print (line.split('\t')[col])
    

    And a file named reduce-avg.py

    import sys
    s=0
    n=0
    for line in sys.stdin:
       s=s+float(line)
       n=n+1
    print (s/n)
    

    And in order to run the whole thing:

    cat strFileName|python map-col.py|python reduce-avg.py>output.txt
    

    This method will work regardless of the size of the file, and will not run out of memory

提交回复
热议问题