What hashing function does Spark use for HashingTF and how do I duplicate it?

╄→尐↘猪︶ㄣ 提交于 2019-12-29 08:46:16

问题


Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms.

1) what function does it use to do the hashing?

2) How can I achieve the same hashed value from Python?

3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this?


回答1:


If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows:

def indexOf(self, term):
    """ Returns the index of the input term. """
    return hash(term) % self.numFeatures

As you can see it is just a plain old hash module number of buckets.

Final hash is just a vector of counts per bucket (I've omitted docstring and RDD case for brevity):

def transform(self, document):
    freq = {}
    for term in document:
        i = self.indexOf(term)
        freq[i] = freq.get(i, 0) + 1.0
    return Vectors.sparse(self.numFeatures, freq.items())

If you want to ignore frequencies then you can use set(document) as an input, but I doubt there is much to gain here. To create set you'll have to compute hash for each element anyway.




回答2:


It seems to me that there is something else going on under the hood other than what the source that zero323 linked. I found that hashing and then doing the modulus as the source code did wouldn't give me the same indices as hashingTF generates. At least for single characters, what I had to do was convert the char to the ascii code, like so: (Python 2.7)

index = ord('a') # 97

Which corresponds to what hashingtf outputs for the index. If I did the same thing as hashingtf appears to do, which is:

index = hash('a') % 1<<20 # 897504

I would get very clearly the wrong index.



来源:https://stackoverflow.com/questions/31540164/what-hashing-function-does-spark-use-for-hashingtf-and-how-do-i-duplicate-it

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!