Select a multiple-key cross section from a DataFrame

前端未结

关注

 4  1753

青春惊慌失措 2020-12-05 18:38

I have a DataFrame \"df\" with (time,ticker) Multiindex and bid/ask/etc data columns:


                          tod    last     bid      ask      volume
    tim

4条回答

失恋的感觉 (楼主)

2020-12-05 19:18

For what it is worth, I did the following:

foo = pd.DataFrame(np.random.rand(12,3), 
                   index=pd.MultiIndex.from_product([['A','B','C','D'],['Green','Red','Blue']], 
                                                    names=['Letter','Color']),
                   columns=['X','Y','Z']).sort_index()

foo.reset_index()\
   .loc[foo.reset_index().Color.isin({'Green','Red'})]\
   .set_index(foo.index.names)

This approach is similar to select, but avoids iterating over all rows with a lambda.

However, I compared this to the Panel approach, and it appears the Panel solution is faster (2.91 ms for index/loc vs 1.48 ms for to_panel/to_frame:

foo.to_panel()[:,:,['Green','Red']].to_frame()

Times:

In [56]:
%%timeit
foo.reset_index().loc[foo.reset_index().Color.isin({'Green','Red'})].set_index(foo.index.names)
100 loops, best of 3: 2.91 ms per loop

In [57]:
%%timeit
foo2 = foo.reset_index()
foo2.loc[foo2.Color.eq('Green') | foo2.Color.eq('Red')].set_index(foo.index.names)
100 loops, best of 3: 2.85 ms per loop

In [58]:
%%timeit
foo2 = foo.reset_index()
foo2.loc[foo2.Color.ne('Blue')].set_index(foo.index.names)
100 loops, best of 3: 2.37 ms per loop

In [54]:
%%timeit
foo.to_panel()[:,:,['Green','Red']].to_frame()
1000 loops, best of 3: 1.18 ms per loop

UPDATE

After revisiting this topic (again), I observed the following:

In [100]:
%%timeit
foo2 = pd.DataFrame({k: foo.loc[k] for k in foo.index if k[1] in ['Green','Red']}).transpose()
foo2.index.names = foo.index.names
foo2.columns.names = foo2.columns.names
100 loops, best of 3: 1.97 ms per loop

In [101]:
%%timeit
foo2 = pd.DataFrame.from_dict({k: foo.loc[k] for k in foo.index if k[1] in ['Green','Red']}, orient='index')
foo2.index.names = foo.index.names
foo2.columns.names = foo2.columns.names
100 loops, best of 3: 1.82 ms per loop

If you don't care about preserving the original order and naming of the levels, you can use:

%%timeit
pd.concat({key: foo.xs(key, axis=0, level=1) for key in ['Green','Red']}, axis=0)
1000 loops, best of 3: 1.31 ms per loop

And if you are just selecting on the first level:

%%timeit
pd.concat({key: foo.loc[key] for key in ['A','B']}, axis=0, names=foo.index.names)
1000 loops, best of 3: 1.12 ms per loop

versus:

%%timeit
foo.to_panel()[:,['A','B'],:].to_frame()
1000 loops, best of 3: 1.16 ms per loop

Another Update

If you sort the index of the example foo, many of the times above improve (times have been updated to reflect a pre-sorted index). However, when the index is sorted, you can use the solution described by user674155:

%%timeit
foo.loc[(slice(None), ['Blue','Red']),:]
1000 loops, best of 3: 582 µs per loop

This is the most efficient and intuitive in my opinion (the user doesn't need to understand panels and how they are created from frames).

Note: even if the index has not yet been sorted, sorting the index of foo on the fly is comparable in performance to the to_panel option.

0 讨论(0)

查看其它4个回答