How to define a primary key field in a Lucene document to get the best lookup performance?

那年仲夏 提交于 2020-01-23 07:52:36

问题


When creating a document in my Lucene index (v7.2), I add a uid field to it which contains a unique id/key (string):

doc.add(new StringField("uid",uid,Field.Store.YES))

To retrieve that document later on, I create a TermQuery for the given unique id and search for it with an IndexSearcher:

searcher.search(new TermQuery(new Term("uid",uid)),1)

Being a Lucene "novice", I would like to know the following:

  1. How should I improve this approach to get the best lookup performance? Would it, for example, make a difference if I store the unique id as a byte array instead of as a string? Or are there some special codecs or filters that can be used?

  2. What is the time complexity of looking up a document by its unique id? Since the index contains at least one unique term for each document, the lookup times will increase linearly with the number of documents (O(n)), right?


回答1:


Theory

There is a blog post about Lucene term index and lookup performance. It clearly reveals all the details of complexity of looking up a document by id. This post is quite old, but nothing was changed since then.

Here is some highlights related to your question:

  • Lucene is a search engine where the minimum element of retrieval is a text term, so this means: binary, number and string fields are represented as strings in the BlockTree terms dictionary.
  • In general, the complexity of lookup depends on the term length: Lucene uses an in-memory prefix-trie index structure to perform a term lookup. Due to restrictions of real-world hardware and software implementations (in order to avoid superfluous disk reads and memory overflow for extremely large tries), Lucene uses a BlockTree structure. This means it stores prefix-trie in small chunks on disk and loads only one chunk at time. This is why it's so important to generate keys in an easy-to-read order. So let's arrange the factors according to the degree of their influence:
    • term's length - more chunks to load
    • term's pattern - to avoid superfluous reads
    • terms count - to reduce chunks count

Algorithms and Complexity

Let term be a single string and let term dictionary be a large set of terms. If we have a term dictionary, and we need to know whether a single term is inside the dictionary, the trie (and minimal deterministic acyclic finite state automaton (DAFSA) as a subclass) is the data structure that can help us. On your question: “Why use tries if a hash lookup can do the same?”, here are a few reasons:

  • The tries can find strings in O(L) time (where L represents the length of a single term). This is a bit faster compared to hash table in the worst case (hash table requires linear scan in case of hash collisions and sophisticated hashing algorithm like MurmurHash3), or similar to a hash table in perfect case.
  • The hash tables can only find terms of a dictionary that exactly match with the single term that we are looking for; whereas the trie allows us to find terms that have a single different character, a prefix in common, a character missing, etc.
  • The trie can provide an alphabetical ordering of the entries by key, so we can enumerate all terms in alphabetical order.
  • The trie (and especially DAFSA) provides a very compact representation of terms with deduplication.

Here is an example of DAFSA for 3 terms: bath, bat and batch:

In case of key lookup, notice that lowering a single level in the automata (or trie) is done in constant time, and every time that the algorithm lowers a single level in the automata (trie), a single character is cut from the term, so we can conclude that finding a term in a automata (trie) can be done in O(L) time.



来源:https://stackoverflow.com/questions/48050830/how-to-define-a-primary-key-field-in-a-lucene-document-to-get-the-best-lookup-pe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!