Why does Spark's OneHotEncoder drop the last category by default?
I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default. For example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer(inputCol="c",outputCol="c_idx") >>> ff = ss.fit(fd).transform(fd) >>> ff.show() +----+---+-----+ | x| c|c_idx| +----+---+-----+ | 1.0| a| 0.0| | 1.5| a| 0.0| |10.0| b| 1.0| | 3.2| c| 2.0| +----+---+-----+ By default, the OneHotEncoder will drop the last category: >>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec") >>> fe = oe.transform(ff) >>>