What hashing function does Spark use for HashingTF and how do I duplicate it?

后端 未结 2 1435
挽巷
挽巷 2020-12-21 02:12

Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms.

1) what function does it use to do the

相关标签:
2条回答
  • 2020-12-21 02:19

    If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows:

    def indexOf(self, term):
        """ Returns the index of the input term. """
        return hash(term) % self.numFeatures
    

    As you can see it is just a plain old hash module number of buckets.

    Final hash is just a vector of counts per bucket (I've omitted docstring and RDD case for brevity):

    def transform(self, document):
        freq = {}
        for term in document:
            i = self.indexOf(term)
            freq[i] = freq.get(i, 0) + 1.0
        return Vectors.sparse(self.numFeatures, freq.items())
    

    If you want to ignore frequencies then you can use set(document) as an input, but I doubt there is much to gain here. To create set you'll have to compute hash for each element anyway.

    0 讨论(0)
  • 2020-12-21 02:26

    It seems to me that there is something else going on under the hood other than what the source that zero323 linked. I found that hashing and then doing the modulus as the source code did wouldn't give me the same indices as hashingTF generates. At least for single characters, what I had to do was convert the char to the ascii code, like so: (Python 2.7)

    index = ord('a') # 97
    

    Which corresponds to what hashingtf outputs for the index. If I did the same thing as hashingtf appears to do, which is:

    index = hash('a') % 1<<20 # 897504
    

    I would get very clearly the wrong index.

    0 讨论(0)
提交回复
热议问题