Efficient way to read 15 M lines csv files in python

前端 未结 2 2123
花落未央
花落未央 2021-02-01 07:39

For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format.

I\'ve already tried different

2条回答
  •  灰色年华
    2021-02-01 08:11

    First, lets answer the title of the question

    1- How to eficiently read 15M lines of a csv containing floats

    I suggest you use modin:

    Generating sample data:

    import modin.pandas as mpd
    import pandas as pd
    import numpy as np
    
    frame_data = np.random.randint(0, 10_000_000, size=(15_000_000, 2)) 
    pd.DataFrame(frame_data*0.0001).to_csv('15mil.csv', header=False)
    
    !wc 15mil*.csv ; du -h 15mil*.csv
    
        15000000   15000000  480696661 15mil.csv
        459M    15mil.csv
    

    Now to the benchmarks:

    %%timeit -r 3 -n 1 -t
    global df1
    df1 = pd.read_csv('15mil.csv', header=None)
        9.7 s ± 95.1 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
    
    %%timeit -r 3 -n 1 -t
    global df2
    df2 = mpd.read_csv('15mil.csv', header=None)
        3.07 s ± 685 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
    
    (df2.values == df1.values).all()
        True
    

    So as we can see modin was approximatly 3 times faster on my setup.


    Now to answer your specific problem

    2- Cleaning a csv file that contains non numeric characters, and then reading it

    As people have noted, your bottleneck is probably the converter. You are calling those lambdas 30 Million times. Even the function call overhead becomes non-trivial at that scale.

    Let's attack this problem.

    Generating dirty dataset:

    !sed 's/.\{4\}/&)/g' 15mil.csv > 15mil_dirty.csv
    

    Approaches

    First, I tried using modin with the converters argument. Then, I tried a different approach that calls the regexp less times:

    First I will create a File-like object that filters everything through your regexp:

    class FilterFile():
        def __init__(self, file):
            self.file = file
        def read(self, n):
            return re.sub(r"[^\d.,\n]", "", self.file.read(n))
        def write(self, *a): return self.file.write(*a) # needed to trick pandas
        def __iter__(self, *a): return self.file.__iter__(*a) # needed
    

    Then we pass it to pandas as the first argument in read_csv:

    with open('15mil_dirty.csv') as file:
        df2 = pd.read_csv(FilterFile(file))
    

    Benchmarks:

    %%timeit -r 1 -n 1 -t
    global df1
    df1 = pd.read_csv('15mil_dirty.csv', header=None,
            converters={0: lambda x: np.float32(re.sub(r"[^\d.]", "", x)),
                        1: lambda x: np.float32(re.sub(r"[^\d.]", "", x))}
               )
        2min 28s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
    
    %%timeit -r 1 -n 1 -t
    global df2
    df2 = mpd.read_csv('15mil_dirty.csv', header=None,
            converters={0: lambda x: np.float32(re.sub(r"[^\d.]", "", x)),
                        1: lambda x: np.float32(re.sub(r"[^\d.]", "", x))}
               )
        38.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
    
    %%timeit -r 1 -n 1 -t
    global df3
    df3 = pd.read_csv(FilterFile(open('15mil_dirty.csv')), header=None,)
        1min ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
    

    Seems like modin wins again! Unfortunatly modin has not implemented reading from buffers yet so I devised the ultimate approach.

    The Ultimate Approach:

    %%timeit -r 1 -n 1 -t
    with open('15mil_dirty.csv') as f, open('/dev/shm/tmp_file', 'w') as tmp:
        tmp.write(f.read().translate({ord(i):None for i in '()'}))
    df4 = mpd.read_csv('/dev/shm/tmp_file', header=None)
        5.68 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
    

    This uses translate which is considerably faster than re.sub, and also uses /dev/shm which is in-memory filesystem that ubuntu (and other linuxes) usually provide. Any file written there will never go to disk, so it is fast. Finally, it uses modin to read the file, working around modin's buffer limitation. This approach is about 30 times faster than your approach, and it is pretty simple, also.

提交回复
热议问题