问题
I'm looking to use scikit-learn's HashingVectorizer because it's a great fit for online learning problems (new tokens in text are guaranteed to map to a "bucket"). Unfortunately the implementation included in scikit-learn doesn't seem to include support for tf-idf features. Is passing the vectorizer output through a TfidfTransformer the only way to make online updates work with tf-idf features, or is there a more elegant solution out there?
回答1:
Intrinsically you can not use TF IDF in an online fashion, as the IDF of all past features will change with every new document - which would mean re-visiting and re-training on all the previous documents, which would no-longer be online.
There may be some approximations, but you would have to implement them yourself.
回答2:
You can do "online" TF-IDF, contrary to what was said in the accepted answer.
In fact, every search engine (e.g. Lucene) does.
What does not work if assuming you have TF-IDF vectors in memory.
Search engines such as lucene naturally avoid keeping all data in memory. Instead they load one column at a time (which due to sparsity is not a lot). IDF arises trivially from the length of the inverted list.
The point is, you don't transform your data into TF-IDF, and then do standard cosine similarity.
Instead, you use the current IDF weights when computing similarities, using a weighted cosine similarity (often modified with additional weighting, boosting terms, penalizing terms, etc.)
This approach will work essentially with any algorithm that allows attribute weighting at evaluation time. Many algorithms will do, but very few implementations are flexible enough, unfortunately. Most expect you to multiply the weights into your data matrix before training, unfortunately.
来源:https://stackoverflow.com/questions/24517793/online-version-of-scikit-learns-tfidfvectorizer