Split a dataframe into correspondingly named arrays or series (then recombine)

问题

Let's say I have a dataframe with columns x and y. I'd like to automatically split it into arrays (or series) that have the same names as the columns, process the data, and then later rejoin them. It's pretty straightforward to do this manually:

x, y = df.x, df.y
z = x + y   # in actual use case, there are hundreds of lines like this
df = pd.concat([x,y,z],axis=1)

But I'd like to automate this. It's easy to get a list of strings with df.columns, but I really want [x,y] rather than ['x','y']. The best I can do so far is to work around that with exec:

df_orig = DataFrame({ 'x':range(1000), 'y':range(1000,2000),  'z':np.zeros(1000) })

def method1( df ):

   for col in df.columns:
      exec( col + ' = df.' + col + '.values')

   z = x + y   # in actual use case, there are hundreds of lines like this

   for col in df.columns:   
      exec( 'df.' + col + '=' + col )

df = df_orig.copy() 
method1( df )         # df appears to be view of global df, no need to return it
df1 = df

So there are 2 issues:

1) Using exec like this is generally not a good idea (and has already caused me a problem when I tried to combine this with numba) --or is that bad? It seems to work fine for series and arrays.

2) I'm not sure the best way to take advantage of views here. Ideally all that I really want to do here is use x as a view of df.x. I assume that is not possible where x is an array but maybe it is if x is a series?

The example above is for arrays, but ideally I'm looking for a solution that also applies to series. In lieu of that, solutions that work with one or the other are welcome of course.

Motivation:

1) Readability, which can partially be achieved with eval, but I don't believe eval can be used over multiple lines?

2) With multiple lines like z=x+y, this method is a little faster with series (2x or 3x in examples I've tried) and even faster with arrays (over 10x). See here: Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba

回答1:

This doesn't do exactly what you want, but another path to think about.

There's a gist here that defines a context manager that allows you to reference columns as if they were locals. I didn't write this, and it's a little old, but still seems to work with the current version of pandas.

In [45]: df = pd.DataFrame({'x': np.random.randn(100000), 'y': np.random.randn(100000)})

In [46]: with DataFrameContextManager(df):
    ...:     z = x + y
    ...:     

In [47]: z.head()
Out[47]: 
0   -0.821079
1    0.035018
2    1.180576
3   -0.155916
4   -2.253515
dtype: float64

回答2:

Just use indexing notation and a dictionary, instead of attribute notation.

df_orig = DataFrame({ 'x':range(1000), 'y':range(1000,2000),  'z':np.zeros(1000) })

def method1( df ):

   series = {}
   for col in df.columns:
      series[col] = df[col]

   series['z'] = series['x'] + series['y']   # in actual use case, there are hundreds of lines like this

   for col in df.columns:   
      df[col] = series[col]

df = df_orig.copy() 
method1( df )         # df appears to be view of global df, no need to return it
df1 = df

来源：https://stackoverflow.com/questions/25896654/split-a-dataframe-into-correspondingly-named-arrays-or-series-then-recombine

标签

python

pandas

numpy

numba