What hashing function does Spark use for HashingTF and how do I duplicate it?
问题 Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms. 1) what function does it use to do the hashing? 2) How can I achieve the same hashed value from Python? 3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this? 回答1: If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows: def indexOf(self, term):