Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms.
1) what function does it use to do the
If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows:
def indexOf(self, term):
""" Returns the index of the input term. """
return hash(term) % self.numFeatures
As you can see it is just a plain old hash module number of buckets.
Final hash is just a vector of counts per bucket (I've omitted docstring and RDD case for brevity):
def transform(self, document):
freq = {}
for term in document:
i = self.indexOf(term)
freq[i] = freq.get(i, 0) + 1.0
return Vectors.sparse(self.numFeatures, freq.items())
If you want to ignore frequencies then you can use set(document) as an input, but I doubt there is much to gain here. To create set you'll have to compute hash for each element anyway.