load a certain number of rows from csv with numpy

回眸只為那壹抹淺笑 提交于 2019-12-01 08:49:37

问题


I have a very long file and I only need parts, a slice, of it. There is new data coming in so the file will potentially get longer.

To load the data from the CSV I use numpy.genfromtxt

    np.genfromtxt(filename, usecols={col}, delimiter=",", skip_header=skip_head)

This cuts off a certain parts of the file in the beginning which already substantially speeds up the process of loading the data. But I can't use skip_footer in the end to cut off the part after my slice that I want to use.

What I want is to only load a certain number of rows. e.g. lets say I skip the first 100 rows, then load the next 50 rows and skip the rest afterwards.

edit: I am using Python 3.4
edit: sample file: http://www.file-upload.net/download-10819938/sample.txt.html


回答1:


You could get the slice using itertools, taking the column using itemgetter:

import  numpy as np
from operator import itemgetter
import csv
with open(filename) as f:
   from itertools import islice,imap
   r = csv.reader(f)
   np.genfromtxt(imap(itemgetter(1),islice(r,  start, end+1)))

For python3, you can use fromiter with the code above you need to specify the dtype:

import numpy as np
from operator import itemgetter
import csv
with open("sample.txt") as f:
   from itertools import islice
   r = csv.reader(f)
   print(np.fromiter(map(itemgetter(0), islice(r,  start, end+1)), dtype=float))

As in the other answer you can also pass the islice object directly to genfromtxt but for python3 you will need to open the file in binary mode:

with open("sample.txt", "rb") as f:
    from itertools import islice
    print(np.genfromtxt(islice(f, start, end+1), delimiter=",", usecols=cols))

Interestingly, for multiple columns using itertools.chain and reshaping is over twice as efficient if all your dtypes are the same:

from itertools import islice,chain
with open("sample.txt") as f:
   r = csv.reader(f)
   arr =np.fromiter(chain.from_iterable(map(itemgetter(0, 4, 10), 
                                            islice(r,  4, 10))), dtype=float).reshape(6, -1) 

On you sample file:

In [27]: %%timeit
with open("sample.txt", "rb") as f:
    (np.genfromtxt(islice(f, 4, 10), delimiter=",", usecols=(0, 4, 10),dtype=float))
   ....: 

10000 loops, best of 3: 179 µs per loop

In [28]: %%timeit
with open("sample.txt") as f:
   r = csv.reader(f)                                                               (np.fromiter(chain.from_iterable(map(itemgetter(0, 4, 10), islice(r,  4, 10))), dtype=float).reshape(6, -1))

10000 loops, best of 3: 86 µs per loop



回答2:


Following this example, you should be able to use itertools.islice, without needing imap, map or csv.reader:

import numpy as np
import itertools

with open('sample.txt') as f:
    # this will skip 100 lines, then read the next 50
    d=np.genfromtxt(itertools.islice(f,100,150),delimiter=',',usecols={cols})



回答3:


Starting Numpy 1.10, np.genfromtxt takes an optional parameter max_rows which limits the number of lines to read.

Combined with the other optional parameter skip_header, you can select a slice of your file (for instance lines 100 to 150):

import numpy as np

np.loadtxt('file.txt', skip_header=100, max_rows=50)


来源:https://stackoverflow.com/questions/31833004/load-a-certain-number-of-rows-from-csv-with-numpy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!