Create a pandas DataFrame from generator?

前端 未结 5 885
野的像风
野的像风 2020-12-08 07:07

I\'ve create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.

I\'ve

相关标签:
5条回答
  • 2020-12-08 07:22

    If generator is just like a list of DataFrames, you need just to create a new DataFrame concatenating elements of the list:

    result = pd.concat(list)

    Recently I've faced the same problem.

    0 讨论(0)
  • 2020-12-08 07:26

    You can also use something like (Python tested in 2.7.5)

    from itertools import izip
    
    def dataframe_from_row_iterator(row_iterator, colnames):
        col_iterator = izip(*row_iterator)
        return pd.DataFrame({cn: cv for (cn, cv) in izip(colnames, col_iterator)})
    

    You can also adapt this to append rows to a DataFrame.

    -- Edit, Dec 4th: s/row/rows in last line

    0 讨论(0)
  • You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).

    Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv or something else...

    If you want to get super complicated you can create a file like object that will return the lines:

    def gen():
        lines = [
            'col1,col2\n',
            'foo,bar\n',
            'foo,baz\n',
            'bar,baz\n'
        ]
        for line in lines:
            yield line
    
    class Reader(object):
        def __init__(self, g):
            self.g = g
        def read(self, n=0):
            try:
                return next(self.g)
            except StopIteration:
                return ''
    

    And then use the read_csv:

    >>> pd.read_csv(Reader(gen()))
      col1 col2
    0  foo  bar
    1  foo  baz
    2  bar  baz
    
    0 讨论(0)
  • 2020-12-08 07:36

    To get it to be memory efficient, read in chunks. Something like this, using Viktor's Reader class from above.

    df = pd.concat(list(pd.read_csv(Reader(gen()),chunksize=10000)),axis=1)
    
    0 讨论(0)
  • 2020-12-08 07:38

    You certainly can construct a pandas.DataFrame() from a generator of tuples, as of version 19 (and probably earlier). Don't use .from_records(); just use the constructor, for example:

    import pandas as pd
    someGenerator = ( (x, chr(x)) for x in range(48,127) )
    someDf = pd.DataFrame(someGenerator)
    

    Produces:

    type(someDf) #pandas.core.frame.DataFrame
    
    someDf.dtypes
    #0     int64
    #1    object
    #dtype: object
    
    someDf.tail(10)
    #      0  1
    #69  117  u
    #70  118  v
    #71  119  w
    #72  120  x
    #73  121  y
    #74  122  z
    #75  123  {
    #76  124  |
    #77  125  }
    #78  126  ~
    
    0 讨论(0)
提交回复
热议问题