Pandas DataFrames with NaNs equality comparison

前端 未结 5 1905
没有蜡笔的小新
没有蜡笔的小新 2020-11-29 07:26

In the context of unit testing some functions, I\'m trying to establish the equality of 2 DataFrames using python pandas:

ipdb> expect
                            


        
相关标签:
5条回答
  • 2020-11-29 07:41

    You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:

    In [11]: from pandas.testing import assert_frame_equal
    
    In [12]: assert_frame_equal(df, expected, check_names=False)
    

    You can wrap this in a function with something like:

    try:
        assert_frame_equal(df, expected, check_names=False)
        return True
    except AssertionError:
        return False
    

    In more recent pandas this functionality has been added as .equals:

    df.equals(expected)
    
    0 讨论(0)
  • 2020-11-29 07:43

    Like @PhillipCloud answer, but more written out

    In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])
    
    In [27]: df2 = df1.copy()
    

    They really are equivalent

    In [28]: result = df1 == df2
    
    In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True
    
    In [30]: result
    Out[30]: 
          0     1
    0  True  True
    1  True  True
    

    A nan in df2 that doesn't exist in df1

    In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])
    
    In [32]: result = df1 == df2
    
    In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True
    
    In [34]: result
    Out[34]: 
           0     1
    0   True  True
    1  False  True
    

    You can also fill with a value you know not to be in the frame

    In [38]: df1.fillna(-999) == df1.fillna(-999)
    Out[38]: 
          0     1
    0  True  True
    1  True  True
    
    0 讨论(0)
  • 2020-11-29 07:46
    df.fillna(0) == df2.fillna(0)
    

    You can use fillna(). Documenation here.

    from pandas import DataFrame
    
    # create a dataframe with NaNs
    df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
    df2 = df
    
    # comparison fails!
    print df == df2
    
    # all is well 
    print df.fillna(0) == df2.fillna(0)
    
    0 讨论(0)
  • 2020-11-29 07:54

    Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.

    Simply, df1.fillna('NULL') == df2.fillna('NULL'), if 'NULL' is not a value in the original data.

    To be safe, do the following:

    Example a) Compare two dataframes with NaN values

    bools = (df1 == df2)
    bools[pd.isnull(df1) & pd.isnull(df2)] = True
    assert bools.all().all()
    

    Example b) Filter rows in df1 that do not match with df2

    bools = (df1 != df2)
    bools[pd.isnull(df1) & pd.isnull(df2)] = False
    df_outlier = df1[bools.all(axis=1)]
    

    (Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)

    0 讨论(0)
  • 2020-11-29 07:56

    One of the properties of NaN is that NaN != NaN is True.

    Check out this answer for a nice way to do this using numexpr.

    (a == b) | ((a != a) & (b != b))
    

    says this (in pseudocode):

    a == b or (isnan(a) and isnan(b))
    

    So, either a equals b, or both a and b are NaN.

    If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.

    In [1]: df = DataFrame(rand(1e7, 15))
    
    In [2]: df = df[df > 0.5]
    
    In [3]: df2 = df.copy()
    
    In [4]: df
    Out[4]:
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 10000000 entries, 0 to 9999999
    Columns: 15 entries, 0 to 14
    dtypes: float64(15)
    
    In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
    1 loops, best of 3: 598 ms per loop
    

    timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:

    In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
    1 loops, best of 3: 687 ms per loop
    
    0 讨论(0)
提交回复
热议问题