Split a dataframe into correspondingly named arrays or series (then recombine)

走远了吗. 提交于 2019-12-08 11:25:59

问题


Let's say I have a dataframe with columns x and y. I'd like to automatically split it into arrays (or series) that have the same names as the columns, process the data, and then later rejoin them. It's pretty straightforward to do this manually:

x, y = df.x, df.y
z = x + y   # in actual use case, there are hundreds of lines like this
df = pd.concat([x,y,z],axis=1)

But I'd like to automate this. It's easy to get a list of strings with df.columns, but I really want [x,y] rather than ['x','y']. The best I can do so far is to work around that with exec:

df_orig = DataFrame({ 'x':range(1000), 'y':range(1000,2000),  'z':np.zeros(1000) })

def method1( df ):

   for col in df.columns:
      exec( col + ' = df.' + col + '.values')

   z = x + y   # in actual use case, there are hundreds of lines like this

   for col in df.columns:   
      exec( 'df.' + col + '=' + col )

df = df_orig.copy() 
method1( df )         # df appears to be view of global df, no need to return it
df1 = df

So there are 2 issues:

1) Using exec like this is generally not a good idea (and has already caused me a problem when I tried to combine this with numba) --or is that bad? It seems to work fine for series and arrays.

2) I'm not sure the best way to take advantage of views here. Ideally all that I really want to do here is use x as a view of df.x. I assume that is not possible where x is an array but maybe it is if x is a series?

The example above is for arrays, but ideally I'm looking for a solution that also applies to series. In lieu of that, solutions that work with one or the other are welcome of course.

Motivation:

1) Readability, which can partially be achieved with eval, but I don't believe eval can be used over multiple lines?

2) With multiple lines like z=x+y, this method is a little faster with series (2x or 3x in examples I've tried) and even faster with arrays (over 10x). See here: Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba


回答1:


This doesn't do exactly what you want, but another path to think about.

There's a gist here that defines a context manager that allows you to reference columns as if they were locals. I didn't write this, and it's a little old, but still seems to work with the current version of pandas.

In [45]: df = pd.DataFrame({'x': np.random.randn(100000), 'y': np.random.randn(100000)})

In [46]: with DataFrameContextManager(df):
    ...:     z = x + y
    ...:     

In [47]: z.head()
Out[47]: 
0   -0.821079
1    0.035018
2    1.180576
3   -0.155916
4   -2.253515
dtype: float64



回答2:


Just use indexing notation and a dictionary, instead of attribute notation.

df_orig = DataFrame({ 'x':range(1000), 'y':range(1000,2000),  'z':np.zeros(1000) })

def method1( df ):

   series = {}
   for col in df.columns:
      series[col] = df[col]

   series['z'] = series['x'] + series['y']   # in actual use case, there are hundreds of lines like this

   for col in df.columns:   
      df[col] = series[col]

df = df_orig.copy() 
method1( df )         # df appears to be view of global df, no need to return it
df1 = df


来源:https://stackoverflow.com/questions/25896654/split-a-dataframe-into-correspondingly-named-arrays-or-series-then-recombine

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!