How to write Pyspark UDAF on multiple columns?
问题 I have the following data in a pyspark dataframe called end_stats_df : values start end cat1 cat2 10 1 2 A B 11 1 2 C B 12 1 2 D B 510 1 2 D C 550 1 2 C B 500 1 2 A B 80 1 3 A B And I want to aggregate it in the following way: I want to use the "start" and "end" columns as the aggregate keys For each group of rows, I need to do the following: Compute the unique number of values in both cat1 and cat2 for that group. e.g., for the group of start =1 and end =2, this number would be 4 because