问题
I have a query and a set of documents. I need to rank these documents based on the cosine similarity with tf-idf. Can someone please tell me what support I can get from Lucene to compute this ? What parameters I can directly calculate from Lucene (can I get tf, idf directly through some method in lucene?) and how to compute cosine similarity with Lucene (is there any function which directly returns cosine similarity if I pass two vectors of the query and the document ?)
Thanx in advance
回答1:
Lucene already uses a pimped version of cosine similarity, so if you need the raw CS itself, it's probably doable. I recommend the official page that discusses Lucene scoring.
If you want to extract that info on your own, this would be an outline of the steps for tf:
- index the corpus;
- open an IndexReader;
- iterate over all doc ids, 0 to maxDoc();
- getTermFreqVector(doc, fieldName);
- iterate over the parallel arrays tfv.getTerms()andtfv.getTermFrequencies().
As for the docFreq, use IndexReader.terms() and iterate over this calling termEnum.docFreq().
来源:https://stackoverflow.com/questions/10173202/how-to-calculate-cosine-similarity-with-tf-idf-using-lucene-and-java