I\'ve got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to
To pull out the symmetric differences:
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
For example:
df1 = pd.DataFrame({
'num': [1, 4, 3],
'name': ['a', 'b', 'c'],
})
df2 = pd.DataFrame({
'num': [1, 2, 3],
'name': ['a', 'b', 'd'],
})
Will yield:
Note: until the next release of pandas, to avoid the warning about how the sort argument will be set in the future, just add the sort=False argument. As below:
df_diff = pd.concat([df1,df2], sort=False).drop_duplicates(keep=False)
In my case, I had a weird error, whereby even though the indices, column-names
and values were same, the DataFrames didnt match. I tracked it down to the
data-types, and it seems pandas can sometimes use different datatypes,
resulting in such problems
For example:
param2 = pd.DataFrame({'a': [1]})
param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})
if you check param1.dtypes and param2.dtypes, you will find that 'a' is of
type object for param1 and is of type int64 for param2. Now, if you do
some manipulation using a combination of param1 and param2, other
parameters of the dataframe will deviate from the default ones.
So after the final dataframe is generated, even though the actual values that
are printed out are same, final_df1.equals(final_df2), may turn out to be
not-equal, because those samll parameters like Axis 1, ObjectBlock,
IntBlock maynot be the same.
A easy way to get around this and compare the values is to use
final_df1==final_df2.
However, this will do a element by element comparison, so it wont work if you
are using it to assert a statement for example in pytest.
What works well is
all(final_df1 == final_df2).
This does a element by element comparison, while neglecting the parameters not important for comparison.
If your values and indices are same, but final_df1.equals(final_df2) is showing False, you can use final_df1._data and final_df2._data to check the rest of the elements of the dataframes.