pandas query with a column consisting of array entries

一个人想着一个人 提交于 2019-12-07 12:30:21

问题


ykp.data
Out[182]: 
    state  action  reward  
0    [41]       5      59  
1     [5]      52      48  
2    [46]      35      59  
3    [42]      16      12  
4    [43]      37      48   
5    [36]       5      59   
6    [49]      52      48 
7    [39]      11      23 

I would like to find the row that matches [42] in the state entry so I ran

ykp.data.query('state == [42]')

but I get

Empty DataFrame
Columns: [state, action, reward]
Index: []

when I should be seeing [42], 16, 12.

Can someone please tell me how I can workaround this? I need my state-values to be stored as arrays.


回答1:


Best to avoid pd.Series.apply here. Instead, you can use itertools.chain to construct a regular NumPy array. Then compare the array to an integer to form a Boolean array for indexing:

from itertools import chain

df = pd.DataFrame(np.random.randint(0, 100, size=(100000, 1)), columns=['state'])
df = df.assign(state=df.state.apply(lambda x: [x]), axis=1)

def wen(df):
    df.state=df.state.astype(str)
    return df.query("state == '[42]'")

%timeit df[np.array(list(chain.from_iterable(df['state'].values))) == 42]  # 14.2 ms
%timeit df[df.state.apply(tuple) == (42,)]                                 # 41.9 ms
%timeit df.loc[df.state.apply(lambda x: x==[42])]                          # 33.9 ms
%timeit wen(df)                                                            # 19.9 ms

Better still, don't use lists in your dataframe. Just use regular int series. This will be memory and performance efficient.




回答2:


You can adding astype(str)

df.state=df.state.astype(str)
df.query("state == '[42]'")
Out[290]: 
  state  action  reward
3  [42]      16      12



回答3:


print df[df.state.apply(tuple) == (42,)]
  state  action  reward
3  [42]  16      12    

Another solution (from the @user3483203 comment below):

df.loc[df.state.apply(lambda x: x==[42])]

But the original is 14% faster:

df = pd.DataFrame(np.random.randint(0, 100, size=(100000, 1)), columns=['state'])
df = df.assign(state=df.state.apply(lambda x: [x]), axis=1)

%timeit df[df.state.apply(tuple) == (42,)]
10 loops, best of 3: 24.8 ms per loop

%timeit df.loc[df.state.apply(lambda x: x==[42])]
10 loops, best of 3: 28.8 ms per loop


来源:https://stackoverflow.com/questions/51488681/pandas-query-with-a-column-consisting-of-array-entries

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!