NumPy reading file with filtering lines on the fly

后端 未结 3 1377
南方客
南方客 2021-01-02 08:19

I have a large array of numbers written in a CSV file and need to load only a slice of that array. Conceptually I want to call np.genfromtxt() and then row-slic

3条回答
  •  無奈伤痛
    2021-01-02 08:58

    I can think of two approaches that provide some of the functionality you are asking for:

    1. To read a file either in chunks / or in strides of n-lines / etc.:
      You can pass a generator to numpy.genfromtxt as well as to numpy.loadtxt. This way you can load a large dataset from a textfile memory-efficiently while retaining all the convenient parsing features of the two functions.

    2. To read data only from lines that match a criterion that can be expressed as a regex:
      You can use numpy.fromregex and use a regular expression to precisely define which tokens from a given line in the input file should be loaded. Lines not matching the pattern will be ignored.

    To illustrate the two approaches, I'm going to use an example from my research context.
    I often need to load files with the following structure:

    6
     generated by VMD
      CM         5.420501        3.880814        6.988216
      HM1        5.645992        2.839786        7.044024
      HM2        5.707437        4.336298        7.926170
      HM3        4.279596        4.059821        7.029471
      OD1        3.587806        6.069084        8.018103
      OD2        4.504519        4.977242        9.709150
    6
     generated by VMD
      CM         5.421396        3.878586        6.989128
      HM1        5.639769        2.841884        7.045364
      HM2        5.707584        4.343513        7.928119
      HM3        4.277448        4.057222        7.022429
      OD1        3.588119        6.069086        8.017814
    

    These files can be huge (GBs) and I'm only interested in the numerical data. All data blocks have the same size -- 6 in this example -- and they are always separated by two lines. So the stride of the blocks is 8.

    Using the first approach:

    First I'm going to define a generator that filters out the undesired lines:

    def filter_lines(f, stride):
        for i, line in enumerate(f):
            if i%stride and (i-1)%stride:
                yield line
    

    Then I open the file, create a filter_lines-generator (here I need to know the stride), and pass that generator to genfromtxt:

    with open(fname) as f:
        data = np.genfromtxt(filter_lines(f, 8),
                             dtype='f',
                             usecols=(1, 2, 3))
    

    This works like a breeze. Note that I'm able to use usecols to get rid of the first column of the data. In the same way, you could use all the other features of genfromtxt -- detecting the types, varying types from column to column, missing values, converters, etc.

    In this example data.shape was (204000, 3) while the original file consisted of 272000 lines.

    Here the generator is used to filter homogenously strided lines but one can likewise imagine it filtering out inhomogenous blocks of lines based on (simple) criteria.

    Using the second approach:

    Here's the regexp I'm going to use:

    regexp = r'\s+\w+' + r'\s+([-.0-9]+)' * 3 + r'\s*\n'
    

    Groups -- i.e. inside () -- define the tokens to be extracted from a given line. Next, fromregex does the job and ignores lines not matching the pattern:

    data = np.fromregex(fname, regexp, dtype='f')
    

    The result is exactly the same as in the first approach.

提交回复
热议问题