I\'ve create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.
I\'ve
If generator is just like a list of DataFrames
, you need just to create a new DataFrame
concatenating elements of the list:
result = pd.concat(list)
Recently I've faced the same problem.
You can also use something like (Python tested in 2.7.5)
from itertools import izip
def dataframe_from_row_iterator(row_iterator, colnames):
col_iterator = izip(*row_iterator)
return pd.DataFrame({cn: cv for (cn, cv) in izip(colnames, col_iterator)})
You can also adapt this to append rows to a DataFrame.
-- Edit, Dec 4th: s/row/rows in last line
You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).
Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv
or something else...
If you want to get super complicated you can create a file like object that will return the lines:
def gen():
lines = [
'col1,col2\n',
'foo,bar\n',
'foo,baz\n',
'bar,baz\n'
]
for line in lines:
yield line
class Reader(object):
def __init__(self, g):
self.g = g
def read(self, n=0):
try:
return next(self.g)
except StopIteration:
return ''
And then use the read_csv
:
>>> pd.read_csv(Reader(gen()))
col1 col2
0 foo bar
1 foo baz
2 bar baz
To get it to be memory efficient, read in chunks. Something like this, using Viktor's Reader class from above.
df = pd.concat(list(pd.read_csv(Reader(gen()),chunksize=10000)),axis=1)
You certainly can construct a pandas.DataFrame()
from a generator of tuples, as of version 19 (and probably earlier). Don't use .from_records()
; just use the constructor, for example:
import pandas as pd
someGenerator = ( (x, chr(x)) for x in range(48,127) )
someDf = pd.DataFrame(someGenerator)
Produces:
type(someDf) #pandas.core.frame.DataFrame
someDf.dtypes
#0 int64
#1 object
#dtype: object
someDf.tail(10)
# 0 1
#69 117 u
#70 118 v
#71 119 w
#72 120 x
#73 121 y
#74 122 z
#75 123 {
#76 124 |
#77 125 }
#78 126 ~