Efficient cython file reading, string parsing, and array building

后端 未结 2 468
醉话见心
醉话见心 2020-12-28 10:54

So I have some data files that look like this:

      47
   425   425  -3 15000 15000 900   385   315   3   370   330   2   340   330   2
   325   315   2   3         


        
2条回答
  •  眼角桃花
    2020-12-28 11:09

    Files that are fixed format and well behaved can be read efficiently with Numpy. The idea is to read the file into an array of strings and then convert to integers in one go. The tricky bit is the handling of variable-width fields and the placement of newline characters. One way to do it for your file is:

    def read_chunk_numpy(fh, n_points):
        # 16 chars per point, plus one newline character for every 5 points
        n_bytes = n_points * 16 + (n_points + 1) // 5
    
        txt_arr = np.fromfile(fh, 'S1', n_bytes)
        txt_arr = txt_arr[txt_arr != b'\n']    
        xyz = txt_arr.view('S6,S6,S4').astype('i,i,i')
        xyz.dtype.names = 'x', 'y', 'z'
        return xyz
    

    Note that \n newline characters are assumed, so some more effort is needed for portability. This gave me a huge speedup compared to the plain Python loop. Test code:

    import numpy as np
    
    def write_testfile(fname, n_points):
        with open(fname, 'wb') as fh:
            for _ in range(n_points // 1000):
                n_chunk = np.random.randint(900, 1100)
                fh.write(str(n_chunk).rjust(8) + '\n')
                xyz = np.random.randint(10**4, size=(n_chunk, 3))
                for i in range(0, n_chunk, 5):
                    for row in xyz[i:i+5]:
                        fh.write('%6i%6i%4i' % tuple(row))
                    fh.write('\n')
    
    def read_chunk_plain(fh, n_points):
        points = []
        count = 0
        # Use while-loop because `for line in fh` would mess with file pointer
        while True:
            line = fh.readline()
            n_chunks = int(len(line)/16)
            for i in range(n_chunks):
                chunk = line[16*i:16*(i+1)]
                x = int(chunk[0:6])
                y = int(chunk[6:12])
                z = int(chunk[12:16])
                points.append((x, y, z))
    
                count += 1
                if count == n_points:
                    return points
    
    def test(fname, read_chunk):
        with open(fname, 'rb') as fh:
            line = fh.readline().strip()
            while line:
                n = int(line)
                read_chunk(fh, n)
                line = fh.readline().strip()
    
    fname = 'test.txt'
    write_testfile(fname, 10**5)
    %timeit test(fname, read_chunk_numpy)
    %timeit test(fname, read_chunk_plain)
    

提交回复
热议问题