I have a query and a set of documents. I need to rank these documents based on the cosine similarity with tf-idf. Can someone please tell me what support I can get from Lucene to compute this ? What parameters I can directly calculate from Lucene (can I get tf, idf directly through some method in lucene?) and how to compute cosine similarity with Lucene (is there any function which directly returns cosine similarity if I pass two vectors of the query and the document ?)
Thanx in advance
Lucene already uses a pimped version of cosine similarity, so if you need the raw CS itself, it's probably doable. I recommend the official page that discusses Lucene scoring.
If you want to extract that info on your own, this would be an outline of the steps for tf:
- index the corpus;
- open an
IndexReader; - iterate over all doc ids, 0 to
maxDoc(); getTermFreqVector(doc, fieldName);- iterate over the parallel arrays
tfv.getTerms()andtfv.getTermFrequencies().
As for the docFreq, use IndexReader.terms() and iterate over this calling termEnum.docFreq().
来源:https://stackoverflow.com/questions/10173202/how-to-calculate-cosine-similarity-with-tf-idf-using-lucene-and-java