Customize large datasets comparison in pySpark

给你一囗甜甜゛ 提交于 2020-06-29 04:23:11

问题


I'm using the code below to compare two dataframe and identified differences. However, I'm noticing that I'm simply overwriting my values ( combine_df). My goal is to Flag if row values are different. But not sure what I"m doing wrong.

#Find the overlapping columns in order to compare their values
cols = set(module_df.columns) & (set(expected_df.columns))
#create filter dataframes only with the overlapping columns
filter_module = expected_df.select(list(cols))
filter_expected = expected_df.select(list(cols))
#create Flag columns to serve as identifier 
filter_module = filter_module.withColumn('FLAG',lit('module'))
filter_expected = filter_expected.withColumn('FLAG',lit('expected'))

#join dataframes
combine_df = filter_module.union(filter_expected)

#get column names in order to iterate/partition through 
combine_cols = combine_df.columns
combine_cols.remove('FLAG')
#leverage Windows function 
my_window = Window.partitionBy(combine_cols).rowsBetween(-sys.maxsize, sys.maxsize)

#dataframe with validation flag
combine_df = combine_df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()

回答1:


Have you used correct df

#instead of this
filter_module = expected_df.select(list(cols))
filter_expected = expected_df.select(list(cols))
#use this
filter_module = module_df.select(list(cols))
filter_expected = expected_df.select(list(cols))


来源:https://stackoverflow.com/questions/62272046/customize-large-datasets-comparison-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!