Python Pandas Create Multiple dataframes from list

回眸只為那壹抹淺笑 提交于 2019-12-31 04:04:07

问题


Using this as a quick starting point;

http://pandas.pydata.org/pandas-docs/stable/reshaping.html

In [1]: df
Out[1]: 
         date variable     value
0  2000-01-03        A  0.469112
1  2000-01-04        A -0.282863
2  2000-01-05        A -1.509059
3  2000-01-03        B -1.135632
4  2000-01-04        B  1.212112
5  2000-01-05        B -0.173215
6  2000-01-03        C  0.119209
7  2000-01-04        C -1.044236
8  2000-01-05        C -0.861849
9  2000-01-03        D -2.104569
10 2000-01-04        D -0.494929
11 2000-01-05        D  1.071804

Then isolating 'A' gives this:

In [2]: df[df['variable'] == 'A']
Out[2]: 
        date variable     value
0 2000-01-03        A  0.469112
1 2000-01-04        A -0.282863
2 2000-01-05        A -1.509059

Now creating new dataframe would be:

dfA = df[df['variable'] == 'A'] 

Lets say B's would be:

dfB = df[df['variable'] == 'B'] 

So, Isolating the dataframes into dfA, dfB, dfC......

dfList  = list(set(df['variable']))
dfNames = ["df" + row for row in dfList]  

for i, row in enumerate(dfList):
    dfName = dfNames[i]
    dfNew = df[df['variable'] == row]
    dfNames[i] = dfNew      

It runs... But when try dfA I get output "dfA" is not defined


回答1:


To answer your question literally, globals()['dfA'] = dfNew would define dfA in the global namespace:

for i, row in enumerate(dfList):
    dfName = dfNames[i]
    dfNew = df[df['variable'] == row]
    globals()[dfName] = dfNew   

However, there is never a good reason to define dynamically-named variables.

  • If the names are not known until runtime -- that is, if the names are truly dynamic -- then you you can't use the names in your code since your code has to be written before runtime. So what's the point of creating a variable named dfA if you can't refer to it in your code?

  • If, on the other hand, you know before hand that you will have a variable named dfA, then your code isn't really dynamic. You have static variable names. The only reason to use the loop is to cut down on boiler-plate code. However, even in this case, there is a better alternative. The solution is to use a dict (see below) or list1.

  • Adding dynamically-named variables pollutes the global namespace.

  • It does not generalize well. If you had 100 dynamically named variables, how would you access them? How would you loop over them?

  • To "manage" dynamically named variables you would need to keep a list of their names as strings: e.g. ['dfA', 'dfB', 'dfC',...] and then accessed the newly minted global variables via the globals() dict: e.g. globals()['dfA']. That is awkward.

So the conclusion programmers reach through bitter experience is that dynamically-named variables are somewhere between awkward and useless and it is much more pleasant, powerful, practical to store key/value pairs in a dict. The name of the variable becomes a key in the dict, and the value of the variable becomes the value associated with the key. So, instead of having a bare name dfA you would have a dict dfs and you would access the dfA DataFrame via dfs['dfA']:

dfs = dict()
for i, row in enumerate(dfList):
    dfName = dfNames[i]
    dfNew = df[df['variable'] == row]
    dfs[dfName] = dfNew   

or, as Jianxun Li shows,

dfs = {k: g for k, g in df.groupby('variable')}

This is why Jon Clements and Jianxun Li answered your question by showing alternatives to defining dynamically-named variables. It's because we all believe it is a terrible idea.


Using Jianxun Li's solution, to loop over a dict's key/value pairs you could then use:

dfs = {k: g for k, g in df.groupby('variable')}
for key, df in dfs.items():
    ...

or using Jon Clements' solution, to iterate through groups you could use:

grouped = df.groupby('variable')
for key, df in grouped:
    ...

1If the names are numbered or ordered you could use a list instead of a dict.




回答2:


Use groupby and get_group, eg:

grouped = df.groupby('variable')

Then when you want to do something with each group, access it as such:

my_group = grouped.get_group('A')

Gives you:

    date    variable    value
0   2000-01-03  A   0.469112
1   2000-01-04  A   -0.282863
2   2000-01-05  A   -1.509059



回答3:


df.groupby('variable') returns an iterator with key/df pairs. So to get a list/dict of subgroups,

result = {k: g for k, g in df.groupby('variable')}

from pprint import pprint
pprint(result)

{'A':          date variable   value
0  2000-01-03        A  0.4691
1  2000-01-04        A -0.2829
2  2000-01-05        A -1.5091,
 'B':          date variable   value
3  2000-01-03        B -1.1356
4  2000-01-04        B  1.2121
5  2000-01-05        B -0.1732,
 'C':          date variable   value
6  2000-01-03        C  0.1192
7  2000-01-04        C -1.0442
8  2000-01-05        C -0.8618,
 'D':           date variable   value
9   2000-01-03        D -2.1046
10  2000-01-04        D -0.4949
11  2000-01-05        D  1.0718}


result['A']

         date variable   value
0  2000-01-03        A  0.4691
1  2000-01-04        A -0.2829
2  2000-01-05        A -1.5091


来源:https://stackoverflow.com/questions/31927309/python-pandas-create-multiple-dataframes-from-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!