In Spark, how to do One Hot Encoding for top N frequent values only?
问题 Let, in my dataframe df, I have a column my_category in which I have different values, and I can view the value counts using: df.groupBy("my_category").count().show() value count a 197 b 166 c 210 d 5 e 2 f 9 g 3 Now, I'd like to apply One Hot Encoding (OHE) on this column, but for the top N frequent values only (say N = 3 ), and put all the rest infrequent values in a dummy column (say, "default"). E.g., the output should be something like: a b c default 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 ... 0