Pyspark drop_duplicates(keep=False)
问题 i need a Pyspark solution for Pandas drop_duplicates(keep=False) . Unfortunately, the keep=False option is not available in pyspark... Pandas Example: import pandas as pd df_data = {'A': ['foo', 'foo', 'bar'], 'B': [3, 3, 5], 'C': ['one', 'two', 'three']} df = pd.DataFrame(data=df_data) df = df.drop_duplicates(subset=['A', 'B'], keep=False) print(df) Expected output: A B C 2 bar 5 three A conversion .to_pandas() and back to pyspark is not an option. Thanks! 回答1: Use window function to count