How to get a list of all tokens from Lucene 8.6.1 index using PyLucene?

こ雲淡風輕ζ 提交于 2021-01-05 08:53:15

问题


I have got some direction from this question. I first make the index like below.

import lucene
from  org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import IndexWriterConfig, IndexWriter, DirectoryReader
from org.apache.lucene.store import SimpleFSDirectory
from java.nio.file import Paths
from org.apache.lucene.document import Document, Field, TextField
from org.apache.lucene.util import BytesRefIterator

index_path = "./index"

lucene.initVM()

analyzer = StandardAnalyzer()
config = IndexWriterConfig(analyzer)
if len(os.listdir(index_path))>0:
    config.setOpenMode(IndexWriterConfig.OpenMode.APPEND)

store = SimpleFSDirectory(Paths.get(index_path))
writer = IndexWriter(store, config)

doc = Document()
doc.add(Field("docid", "1",  TextField.TYPE_STORED))
doc.add(Field("title", "qwe rty", TextField.TYPE_STORED))
doc.add(Field("description", "uio pas", TextField.TYPE_STORED))
writer.addDocument(doc)

writer.close()
store.close()

I then try to get all the terms in the index for one field like below.

store = SimpleFSDirectory(Paths.get(index_path))
reader = DirectoryReader.open(store)

Attempt 1: trying to use the next() as used in this question which seems to be a method of BytesRefIterator implemented by TermsEnum.

for lrc in reader.leaves():
    terms = lrc.reader().terms('title')
    terms_enum = terms.iterator()
    while terms_enum.next():
        term = terms_enum.term()
        print(term.utf8ToString())

However, I can't seem to be able to access that next() method.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-47-6515079843a0> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while terms_enum.next():
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

AttributeError: 'TermsEnum' object has no attribute 'next'

Attempt 2: trying to change the while loop as suggested in the comments of this question.

while next(terms_enum):
    term = terms_enum.term()
    print(term.utf8ToString())

However, it seems TermsEnum is not understood to be an iterator by Python.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-48-d490ad78fb1c> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while next(terms_enum):
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

TypeError: 'TermsEnum' object is not an iterator

I am aware that my question can be answered as suggested in this question. Then I guess my question really is, how do I get all the terms in TermsEnum?


回答1:


I found that the below works from here and from test_FieldEnumeration() in the test_Pylucene.py file which is in pylucene-8.6.1/test3/.

for term in BytesRefIterator.cast_(terms_enum):
    print(term.utf8ToString())

Happy to accept an answer that has more explanation than this.



来源:https://stackoverflow.com/questions/64960527/how-to-get-a-list-of-all-tokens-from-lucene-8-6-1-index-using-pylucene

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!