element word positions - conceptual questions

问题

I'm trying to understand the impact of the element word positions index setting. See the following xquery which returns the plan of a simple element-word-query search:

xdmp:plan(cts:search(doc(), 
  cts:and-query(
    cts:element-word-query(xs:QName("name"), "element word position")
  ),
  ("unfiltered")
))

And the final-plan if the index is not activated (reduced form to save space):

<qry:and-query>
    <qry:term-query>element(name),pair(word("element"),word("word"))</qry:term-query>
    <qry:term-query>element(name),pair(word("word"),word("position"))</qry:term-query>
    <qry:term-query>word("element")</qry:term-query>
    <qry:term-query>word("word")</qry:term-query>
    <qry:term-query>word("position")</qry:term-query>
</qry:and-query>

Query plan after the index is activated (word-positions and also element word positions):

<qry:and-query>
    <qry:term-query>element(name),pair(word("element"),word("word"))</qry:term-query>
    <qry:term-query>element(name),pair(word("word"),word("position"))</qry:term-query>
    <qry:element-query>
        element(name)
        <qry:word-query>
            <qry:KP pos="0">word("element")</qry:KP>
            <qry:KP pos="1">word("word")</qry:KP>
            <qry:KP pos="2">word("position")</qry:KP>
        </qry:word-query>
    </qry:element-query>
</qry:and-query>

So i assume, because there are a lot less term-query generated, the resulting candidate fragment id count is going to be smaller and thus the intersection at index resolution is faster. Other than that i'd really like to understand how a element-query works under the hood. So i've got a few questions:

What kind of additional information is saved in the index if element word positions is activated?
How would the index and posting list look like? Is the key only the element or a element+word combination? Are there any graphical resources which visualize it? (not expection you to draw something)
Also how does a element-query execute? I see how a simple term-query returns the posting list of the term key, but i am not sure how a element-query with a word-query as a "sub-query" is evalutated.

Edit: Added a picture to visualize my understanding of how the index might look with element word positions enabled. (See mholstege's answers comments for details)

回答1:

When you turn on positions, we store a positions vector for each document in the index for the relevant term, instead of just the document id.

The way to think about this is in terms of the specificity of the leaf queries and the work involved in calculating them and intersecting intermediate results.

When you see a term-query in the query plan, that means it is just looking up document ids, so there is no knowledge of relative positioning -- a less accurate result for a long phrase like this, because the "element word" and "word position" could be occurring in two separate parent elements in the document. If your data only ever has one element with this name in each document, that could not happen, although you could still have false matches where the two-word subphrases occur in, say, the reverse order, or separated by other words.

When you see word-query in the query plan, that means we are going to be looking at positions, and here you see the relative positions for each of the words in the phrase. When this is resolved, we examine the positions vector and toss out the ones that don't mean this positional constraint. So all the matches will have this sequence of words in this order: a more precise match.

The element-query in the plan is also applying positional constraints, of the element instances relative to the matches inside the element. There are optimizations where the element positional constraints are actually pushed down to the leaves of the query tree to avoid excess intermediate calculations.

You also see some technically redundant term queries: the point of these is to do simple term lookups that are probably more constrained than the leaf word queries. Since intersection of term lists from an and-query always proceeds from the shortest matching posting list, this can provide a fail-fast mechanism to avoid the more expensive positions calculations. There is a certain amount of heuristic judgement in that, and given a complex set of index options and query variations, sometimes those additional terms are, in fact, not helpful.

来源：https://stackoverflow.com/questions/53948303/element-word-positions-conceptual-questions

标签

marklogic