How to get frequently occurring phrases with Lucene

后端 未结 3 933
难免孤独
难免孤独 2020-12-16 05:54

I would like to get some frequently occurring phrases with Lucene. I am getting some information from TXT files, and I am losing a lot of context for not having information

3条回答
  •  太阳男子
    2020-12-16 06:37

    Well the problem of losing the context for phrases can be solved by using PhraseQuery.

    An index by default contains positional information of terms, as long as you did not create pure Boolean fields by indexing with the omitTermFreqAndPositions option. PhraseQuery uses this information to locate documents where terms are within a certain distance of one another.

    For example, suppose a field contained the phrase “the quick brown fox jumped over the lazy dog”. Without knowing the exact phrase, you can still find this document by searching for documents with fields having quick and fox near each other. Sure, a plain TermQuery would do the trick to locate this document knowing either of those words, but in this case we only want documents that have phrases where the words are either exactly side by side (quick fox) or have one word in between (quick [irrelevant] fox). The maximum allowable positional distance between terms to be considered a match is called slop. Distance is the number of positional moves of terms to reconstruct the phrase in order.

    Check out Lucene's JavaDoc for PhraseQuery

    See this example code which demonstrates how to work with various Query Objects:

    You can also try to combine various query types with the help of the BooleanQuery class.

    And regarding the frequency of phrases, I suppose Lucene's scoring considers the frequency of the terms occurring in the documents.

提交回复
热议问题