Create Pandas Dataframe from List of Generators

只愿长相守 提交于 2020-03-26 15:13:21

问题


I have to following question. Is there a way to build a DataFrame from a list of python Generator objects. I used list comprehension to create the list with data for the dataframe:

data_list.append([record.Timestamp,record.Value, record.Name, record.desc] for record in records)

I did it this way because normal list append in a for loop is taking like 20x times longer:

for record in records:
    data_list.append(record.Timestamp,record.Value, record.Name, record.desc)

I tried to create the dataframe but it doesn't work:

This:

dataframe = pd.DataFrame(data_list, columns=['timestamp', 'value', 'name', 'desc'])

Throws exception:

ValueError: 4 columns passed, passed data had 142538 columns.

I also tried to use itertools like this:

dataframe = pd.DataFrame(data=([list(elem) for elem in itt.chain.from_iterable(data_list)]), columns=['timestamp', 'value', 'name', 'desc'])

This results as a empty DataFrame:

Empty DataFrame\nColumns: [timestamp, value, name, desc]\nIndex: []

data_list looks like this:

[<generator object St...51DB0>, <generator object St...56EB8>,<generator object St...51F10>, <generator object St...51F68>]

Code for generating the list looks like this:

for events in events_list:
    for record in events:
        data_list.append([record.Timestamp,record.Value, record.Name, record.desc] for record in records)

This is required because of events list data structure.

Is there a way for me to create a dataframe out of list of Generators? If there is, is it going to be time efficient? What I mean is that I save a lot of time with replacing normal for loop with list comprehension, however if the creation of dataframe takes more time, this action will be pointless.


回答1:


Just turn your data_list into a generator expression as well. For example:

from collections import namedtuple

MyData = namedtuple("MyData", ["a"])
data = (d.a for d in (MyData(i) for i in range(100)))
df = pd.DataFrame(data)

will work just fine. So what you should do is have:

data = ((record.Timestamp,record.Value, record.Name, record.desc) for record in records)
df = pd.DataFrame(data, columns=["Timestamp", "Value", "Name", "Desc"])

The actual reason why your approach does not work is because you have a single entry in your data_list which is a generator over - I suppose - 142538 records. Pandas will try to cram that single entry in your data_list into a single row (so all the 142538 entries, each a list of four elements) and fails, since it expects rather 4 columns to be passed.

Edit: you can of course make the generator expression more complex, here's an example along the lines of your additional loop over events:

from collections import namedtuple
MyData = namedtuple("MyData", ["a", "b"])
data = ((d.a, d.b) for j in range(100) for d in (MyData(j, j+i) for i in range(100)))
pd.DataFrame(data, columns=["a", "b"])

edit: here's also an example using data structures like you are using:

Record = namedtuple("Record", ["Timestamp", "Value", "Name", "desc"])

event_list = [[Record(Timestamp=1, Value=1, Name=1, desc=1),
               Record(Timestamp=2, Value=2, Name=2, desc=2)],
              [Record(Timestamp=3, Value=3, Name=3, desc=3)]]

data = ((r.Timestamp, r.Value, r.Name, r.desc) for events in event_list for r in events)
pd.DataFrame(data, columns=["timestamp", "value", "name", "desc"])

Output:

    timestamp   value   name    desc
0   1   1   1   1
1   2   2   2   2
2   3   3   3   3



回答2:


pd.concat(some_generator_yielding_dfs) will work (this is actually one of the tricks to alleviate the load of big tables). E.g. one may do like this:

pd.concat((pd.read_csv(x) for x in files))



回答3:


Solution

  • Make a dict with the columns you need as shown below.
  • Feed the dict to pandas.Dataframe

Note: The use of list(generator) produces all the data as a list.

import pandas as pd
import ast

# Method-1: create a dict by direct declaration
d = {
    'timestamp': list(record.Timestamp),
    'value': list(record.Value),
    'name': list(record.Name), 
    'desc': list(record.desc), 
}

# Method-2: create a dict using dict-comprehension
keys = ['Timestamp', 'Value', 'Name', 'desc']
d = dict((str(key).lower(), ast.literal_eval(f'list(record.{key})')) for key in keys)

# Finally create the dataframe using the dictionary
dataframe = pd.DataFrame(d).T

See Also:

  • Is there any shorthand for 'yield all the output from a generator'?


来源:https://stackoverflow.com/questions/60490980/create-pandas-dataframe-from-list-of-generators

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!