counting the word frequency in lucene index

点点圈 提交于 2019-11-29 10:25:00

问题


Can someone help me finding the word frequency in all lucene index
for example if doc A has 3 number of word (B) and doc C has 2 of them, I'd like a method to return 5 showing the frequency of word (B) in all lucene index


回答1:


This has been asked multiple times:

  • Get term frequencies in Lucene
  • How to count term frequency for set of documents?
  • Get highest frequency terms from Lucene index
  • How do I get solr term frequency?



回答2:


Assuming you work with Lucene 3.x:

IndexReader ir = IndexReader.open(dir); 
TermDocs termDocs = ir.termDocs(new Term("your_field", "your_word"));
int count = 0;
while (termDocs.next()) {
   count += termDocs.freq();
}

Some comments:

dir is the instance of Lucene Directory class. It's creation differs for RAM and Filesystem indexes, see Lucene documentation for details.

"your_filed" is a filed to search a term. If you have multiple fields, you can run procedure for all of them or, alternatively, when you index your files, you can create special field (e.g. "_content") and keep there concatenated values of all other fields.




回答3:


using lucene 3.4

easy way to get the count, but you need two arrays :-/

int[] docs = new int[1000];
int[] freqs = new int[1000];
int count = indexReader.termDocs(term).read(docs, freqs);

beware: if you would use for read you are not able to use next() any more, because after the read() you are already at the end of the enumeration:

int[] docs = new int[1000];
int[] freqs = new int[1000];
TermDocs td = indexReader.termDocs(term);
int count = td.read(docs, freqs);
while (td.next()){ // always false, already at the end of the enumartion
}


来源:https://stackoverflow.com/questions/4167735/counting-the-word-frequency-in-lucene-index

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!