Finding a single fields terms with Lucene (PyLucene)

风流意气都作罢 提交于 2020-01-01 17:20:13

问题


I'm fairly new to Lucene's Term Vectors - and want to make sure my term gathering is as efficient as it possibly can be. I'm getting the unique terms and then retrieving the docFreq() of the term to perform faceting.

I'm gathering all documents terms from the index using:

lindex = SimpleFSDirectory(File(indexdir))
ireader = IndexReader.open(lindex, True)
terms = ireader.terms() #Returns TermEnum

This works fine, but is there a way to only return terms for specific fields (across all documents) - wouldn't that be more efficient?

Such as:

 ireader.terms(Field="country")

回答1:


IndexReader.terms() accepts an optional Field() object. Field objects are composed of two arguments, the Field Name, and Value which lucene calls the "Term Field" and the "Term Text".

By providing a Field argument with an empty value for 'term text' we can start our term iteration at the term we are concerned with.

lindex = SimpleFSDirectory(File(indexdir))
ireader = IndexReader.open(lindex, True)
# Query the lucene index for the terms starting at a term named "field_name"
terms = ireader.terms(Term("field_name", "")) #Start at the field "field_name"
facets = {'other': 0}
while terms.next():
    if terms.term().field() != "field_name":  #We've got every value
        break
    print "Field Name:", terms.term().field()
    print "Field Value:", terms.term().text()
    print "Matching Docs:", int(ireader.docFreq(term))

Hopefully others searching for how to perform faceting in PyLucene will see come across this post. The key is indexing terms as-is. Just for completeness this is how field values should be indexed.

dir = SimpleFSDirectory(File(indexdir))
analyzer = StandardAnalyzer(Version.LUCENE_30)
writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))
print "Currently there are %d documents in the index..." % writer.numDocs()
print "Adding %s Documents to Index..." % docs.count()
for val in terms:
    doc = Document()
    #Store the field, as-is, with term-vectors.
    doc.add(Field("field_name", val, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.YES))
    writer.addDocument(doc)

writer.optimize()
writer.close()


来源:https://stackoverflow.com/questions/9091028/finding-a-single-fields-terms-with-lucene-pylucene

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!