Equality in Pandas DataFrames - Column Order Matters?

后端 未结 8 1992
说谎
说谎 2020-12-13 19:24

As part of a unit test, I need to test two DataFrames for equality. The order of the columns in the DataFrames is not important to me. However, it seems to matter to Panda

8条回答
  •  无人及你
    2020-12-13 19:50

    Usually you're going to want speedy tests and the sorting method can be brutally inefficient for larger indices (like if you were using rows instead of columns for this problem). The sort method is also susceptible to false negatives on non-unique indices.

    Fortunately, pandas.util.testing.assert_frame_equal has since been updated with a check_like option. Set this to true and the ordering will not be considered in the test.

    With non-unique indices, you'll get the cryptic ValueError: cannot reindex from a duplicate axis. This is raised by the under-the-hood reindex_like operation that rearranges one of the DataFrames to match the other's order. Reindexing is much faster than sorting as evidenced below.

    import pandas as pd
    from pandas.util.testing import assert_frame_equal
    
    df  = pd.DataFrame(np.arange(1e6))
    df1 = df.sample(frac=1, random_state=42)
    df2 = df.sample(frac=1, random_state=43)
    
    %timeit -n 1 -r 5 assert_frame_equal(df1.sort_index(), df2.sort_index())
    ## 5.73 s ± 329 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
    
    %timeit -n 1 -r 5 assert_frame_equal(df1, df2, check_like=True)
    ## 1.04 s ± 237 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
    

    For those who enjoy a good performance comparison plot:

    Reindexing vs sorting on int and str indices (str even more drastic)

提交回复
热议问题