How to get a list of all tokens from Lucene 8.6.1 index?

拈花ヽ惹草 提交于 2021-01-04 06:37:50

问题


I have looked at how to get a list of all tokens from Solr/Lucene index? but Lucene 8.6.1 doesn't seem to offer IndexReader.terms(). Has it been moved or replaced? Is there an easier way than this answer?


回答1:


Some History

You asked: I'm just wondering if IndexReader.terms() has moved or been replaced by an alternative.

The Lucene v3 method IndexReader.terms() was moved to AtomicReader in Lucene v4. This was documented in the v4 alpha release notes.

(Bear in mind that Lucene v4 was released way back in 2012.)

The method in AtomicReader in v4 takes a field name.

As the v4 release notes state:

One big difference is that field and terms are now enumerated separately: a TermsEnum provides a BytesRef (wraps a byte[]) per term within a single field, not a Term.

The key part there is "per term within a single field". So from that point onward there was no longer a single API call to retrieve all terms from an index.

This approach has carried through to later releases - except that the AtomicReader and AtomicReaderContext classes were renamed to LeafReader and LeafReaderContext in Lucene v 5.0.0. See Lucene-5569.

Recent Releases

That leaves us with the ability to access lists of terms - but only on a per-field basis:

The following code is based on the latest release of Lucene (8.7.0), but should also hold true for the version you mention (8.6.1) - with the example using Java:

private void getTokensForField(IndexReader reader, String fieldName) throws IOException {
    List<LeafReaderContext> list = reader.leaves();

    for (LeafReaderContext lrc : list) {
        Terms terms = lrc.reader().terms(fieldName);
        if (terms != null) {
            TermsEnum termsEnum = terms.iterator();

            BytesRef term;
            while ((term = termsEnum.next()) != null) {
                System.out.println(term.utf8ToString());
            }
        }
    }
}

The above example assumes an index as follows:

private static final String INDEX_PATH = "/path/to/index/directory";
...
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));

If you need to enumerate field names, the code in this question may provide a starting point.

Final Note

I guess you can also access terms on a per document basis, instead of a per field basis, as mentioned in the comments. I have not tried this.



来源:https://stackoverflow.com/questions/64921086/how-to-get-a-list-of-all-tokens-from-lucene-8-6-1-index

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!