Pandas read csv out of memory

后端 未结 3 672
迷失自我
迷失自我 2020-12-10 15:45

I try to manipulate a large CSV file using Pandas, when I wrote this

df = pd.read_csv(strFileName,sep=\'\\t\',delimiter=\'\\t\')

it raises

相关标签:
3条回答
  • 2020-12-10 16:15

    Based on your snippet in out of memory error when reading csv file in chunk, when reading line-by-line.

    I assume that kb_2 is the error indicator,

    groups={}
    with open("data/petaJoined.csv", "r") as large_file:
        for line in large_file:
            arr=line.split('\t')
            #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
            k=arr[0]+','+arr[1]
            if not (k in groups.keys())
                groups[k]={'record_count':0, 'error_sum': 0}
            groups[k]['record_count']=groups[k]['record_count']+1
            groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
    for k,v in groups.items:
        print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))
    

    This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.

    It will encounter an out-of-memory exception, if there are too many combinations of groups.

    0 讨论(0)
  • 2020-12-10 16:18

    You haven't stated what your intended aggregation would be, but if it's just sum and count, then you could aggregate in chunks:

    dfs = pd.DataFrame()
    reader = pd.read_table(strFileName, chunksize=16*1024)  # choose as appropriate
    for chunk in reader:
        temp = chunk.agg(...)  # your logic here
        dfs.append(temp)
    df = dfs.agg(...)  # redo your logic here
    
    0 讨论(0)
  • 2020-12-10 16:34

    What @chrisaycock suggested is the preferred method if you need to sum or count

    If you need to average, it won't work because avg(a,b,c,d) does not equal avg(avg(a,b),avg(c,d))

    I suggest using a map-reduce like approach, with streaming

    create a file called map-col.py

    import sys
    for line in sys.stdin:
       print (line.split('\t')[col])
    

    And a file named reduce-avg.py

    import sys
    s=0
    n=0
    for line in sys.stdin:
       s=s+float(line)
       n=n+1
    print (s/n)
    

    And in order to run the whole thing:

    cat strFileName|python map-col.py|python reduce-avg.py>output.txt
    

    This method will work regardless of the size of the file, and will not run out of memory

    0 讨论(0)
提交回复
热议问题