问题
I'm adding a DocValue to a document with
doc.add(new BinaryDocValuesField("foo",new BytesRef("bar")));
To retrieve that value for a specific document with ID docId
, I call
DocValues.getBinary(reader,"foo").get(docId).utf8ToString();
The get
function in BinaryDocValues is supported up to Lucene 6.6, but for Lucene 7.0 and up it does not seem to be available anymore.
So, how do I get the DocValue by document ID in Lucene 7+ (without having to iterate over BinaryDocValues
/ DocIdSetIterator
, and without having to re-get BinaryDocValues
and use advanceExact
every time) ?
回答1:
Theory
Doc values are Lucene's column-stride field value storage. Doc values were intended to be quite fast for random access at query time for faceting and sorting purposes. The following issue LUCENE-7407 switches access pattern from random-access to an iterator. Because an iterator API is a much more restrictive access pattern than an arbitrary random access API, this change gives Lucene more freedom and power to use aggressive compression and other optimizations:
- reduction of disc space usage in case of sparse data
- better compression ratio and speed of decoding of doc values, even in the non-sparse case
- remove special column of missing values(getDocsWithField) and thread local codec readers
You can read about this change in the following blogs:
- Doc values as iterators
- Sparse versus dense document values with Apache Lucene
Practice
In practice this change causes performance degradation in some cases, for example SOLR-9599. In major case(faceting and sorting) an iterative API is OK with proper usage and, even more, allows to perform some optimizations. In fact there are a lot of cases where this API is not a good solution. All these cases were discarded as an incorrect usage(the same problem we had in java word with sun.misc.Unsafe).
In fact, org.apache.lucene.index.DocValuesIterator#advanceExact
is quite fast and has similar performance and complexity in case of some implementations.
来源:https://stackoverflow.com/questions/48474506/how-to-get-docvalue-by-document-id-in-lucene-7