load a certain number of rows from csv with numpy

前端 未结 3 2057
梦毁少年i
梦毁少年i 2021-01-14 15:43

I have a very long file and I only need parts, a slice, of it. There is new data coming in so the file will potentially get longer.

To load the data from the CSV I u

3条回答
  •  情深已故
    2021-01-14 16:19

    You could get the slice using itertools, taking the column using itemgetter:

    import  numpy as np
    from operator import itemgetter
    import csv
    with open(filename) as f:
       from itertools import islice,imap
       r = csv.reader(f)
       np.genfromtxt(imap(itemgetter(1),islice(r,  start, end+1)))
    

    For python3, you can use fromiter with the code above you need to specify the dtype:

    import numpy as np
    from operator import itemgetter
    import csv
    with open("sample.txt") as f:
       from itertools import islice
       r = csv.reader(f)
       print(np.fromiter(map(itemgetter(0), islice(r,  start, end+1)), dtype=float))
    

    As in the other answer you can also pass the islice object directly to genfromtxt but for python3 you will need to open the file in binary mode:

    with open("sample.txt", "rb") as f:
        from itertools import islice
        print(np.genfromtxt(islice(f, start, end+1), delimiter=",", usecols=cols))
    

    Interestingly, for multiple columns using itertools.chain and reshaping is over twice as efficient if all your dtypes are the same:

    from itertools import islice,chain
    with open("sample.txt") as f:
       r = csv.reader(f)
       arr =np.fromiter(chain.from_iterable(map(itemgetter(0, 4, 10), 
                                                islice(r,  4, 10))), dtype=float).reshape(6, -1) 
    

    On you sample file:

    In [27]: %%timeit
    with open("sample.txt", "rb") as f:
        (np.genfromtxt(islice(f, 4, 10), delimiter=",", usecols=(0, 4, 10),dtype=float))
       ....: 
    
    10000 loops, best of 3: 179 µs per loop
    
    In [28]: %%timeit
    with open("sample.txt") as f:
       r = csv.reader(f)                                                               (np.fromiter(chain.from_iterable(map(itemgetter(0, 4, 10), islice(r,  4, 10))), dtype=float).reshape(6, -1))
    
    10000 loops, best of 3: 86 µs per loop
    

提交回复
热议问题