Customize large datasets comparison in pySpark
问题 I'm using the code below to compare two dataframe and identified differences. However, I'm noticing that I'm simply overwriting my values ( combine_df ). My goal is to Flag if row values are different. But not sure what I"m doing wrong. #Find the overlapping columns in order to compare their values cols = set(module_df.columns) & (set(expected_df.columns)) #create filter dataframes only with the overlapping columns filter_module = expected_df.select(list(cols)) filter_expected = expected_df