Python Pandas Create Multiple dataframes from list

问题

Using this as a quick starting point;

http://pandas.pydata.org/pandas-docs/stable/reshaping.html

In [1]: df
Out[1]: 
         date variable     value
0  2000-01-03        A  0.469112
1  2000-01-04        A -0.282863
2  2000-01-05        A -1.509059
3  2000-01-03        B -1.135632
4  2000-01-04        B  1.212112
5  2000-01-05        B -0.173215
6  2000-01-03        C  0.119209
7  2000-01-04        C -1.044236
8  2000-01-05        C -0.861849
9  2000-01-03        D -2.104569
10 2000-01-04        D -0.494929
11 2000-01-05        D  1.071804

Then isolating 'A' gives this:

In [2]: df[df['variable'] == 'A']
Out[2]: 
        date variable     value
0 2000-01-03        A  0.469112
1 2000-01-04        A -0.282863
2 2000-01-05        A -1.509059

Now creating new dataframe would be:

dfA = df[df['variable'] == 'A']

Lets say B's would be:

dfB = df[df['variable'] == 'B']

So, Isolating the dataframes into dfA, dfB, dfC......

dfList  = list(set(df['variable']))
dfNames = ["df" + row for row in dfList]  

for i, row in enumerate(dfList):
    dfName = dfNames[i]
    dfNew = df[df['variable'] == row]
    dfNames[i] = dfNew

It runs... But when try dfA I get output "dfA" is not defined

回答1:

To answer your question literally, globals()['dfA'] = dfNew would define dfA in the global namespace:

for i, row in enumerate(dfList):
    dfName = dfNames[i]
    dfNew = df[df['variable'] == row]
    globals()[dfName] = dfNew

However, there is never a good reason to define dynamically-named variables.

If the names are not known until runtime -- that is, if the names are truly dynamic -- then you you can't use the names in your code since your code has to be written before runtime. So what's the point of creating a variable named dfA if you can't refer to it in your code?
If, on the other hand, you know before hand that you will have a variable named dfA, then your code isn't really dynamic. You have static variable names. The only reason to use the loop is to cut down on boiler-plate code. However, even in this case, there is a better alternative. The solution is to use a dict (see below) or list¹.
Adding dynamically-named variables pollutes the global namespace.
It does not generalize well. If you had 100 dynamically named variables, how would you access them? How would you loop over them?
To "manage" dynamically named variables you would need to keep a list of their names as strings: e.g. ['dfA', 'dfB', 'dfC',...] and then accessed the newly minted global variables via the globals() dict: e.g. globals()['dfA']. That is awkward.

So the conclusion programmers reach through bitter experience is that dynamically-named variables are somewhere between awkward and useless and it is much more pleasant, powerful, practical to store key/value pairs in a dict. The name of the variable becomes a key in the dict, and the value of the variable becomes the value associated with the key. So, instead of having a bare name dfA you would have a dict dfs and you would access the dfA DataFrame via dfs['dfA']:

dfs = dict()
for i, row in enumerate(dfList):
    dfName = dfNames[i]
    dfNew = df[df['variable'] == row]
    dfs[dfName] = dfNew

or, as Jianxun Li shows,

dfs = {k: g for k, g in df.groupby('variable')}

This is why Jon Clements and Jianxun Li answered your question by showing alternatives to defining dynamically-named variables. It's because we all believe it is a terrible idea.

Using Jianxun Li's solution, to loop over a dict's key/value pairs you could then use:

dfs = {k: g for k, g in df.groupby('variable')}
for key, df in dfs.items():
    ...

or using Jon Clements' solution, to iterate through groups you could use:

grouped = df.groupby('variable')
for key, df in grouped:
    ...

¹If the names are numbered or ordered you could use a list instead of a dict.

回答2:

Use groupby and get_group, eg:

grouped = df.groupby('variable')

Then when you want to do something with each group, access it as such:

my_group = grouped.get_group('A')

Gives you:

    date    variable    value
0   2000-01-03  A   0.469112
1   2000-01-04  A   -0.282863
2   2000-01-05  A   -1.509059

回答3:

df.groupby('variable') returns an iterator with key/df pairs. So to get a list/dict of subgroups,

result = {k: g for k, g in df.groupby('variable')}

from pprint import pprint
pprint(result)

{'A':          date variable   value
0  2000-01-03        A  0.4691
1  2000-01-04        A -0.2829
2  2000-01-05        A -1.5091,
 'B':          date variable   value
3  2000-01-03        B -1.1356
4  2000-01-04        B  1.2121
5  2000-01-05        B -0.1732,
 'C':          date variable   value
6  2000-01-03        C  0.1192
7  2000-01-04        C -1.0442
8  2000-01-05        C -0.8618,
 'D':           date variable   value
9   2000-01-03        D -2.1046
10  2000-01-04        D -0.4949
11  2000-01-05        D  1.0718}


result['A']

         date variable   value
0  2000-01-03        A  0.4691
1  2000-01-04        A -0.2829
2  2000-01-05        A -1.5091

来源：https://stackoverflow.com/questions/31927309/python-pandas-create-multiple-dataframes-from-list

标签

python

pandas

ipython