DocumentDb GUID Index Precision

问题

Let's say we have a non-unique GUID/UUID value in our documents:

[
  {
    "id": "123456",
    "Key": "117dfd49-a71d-413b-a9b1-841e88db06e8"
    "Name": "Kaapstad",
  },
  ...
]

We want to query upon this through equality only. No range or orderby querying required. E.g:

SELECT * FROM c where c.Key = "117dfd49-a71d-413b-a9b1-841e88db06e8"

Below is the index definition. It's a hash index (since no range querying will be performed) using a String data type (since Javascript doesn't support Guid natively)

collection.IndexingPolicy.IncludedPaths.Add(
    new IncludedPath { 
        Path = "/Key/?", 
        Indexes = new Collection<Index> { 
            new HashIndex(DataType.String) { Precision = -1 }
        }
    });

But what is the best indexing precision for this?

This MSDN page doesn't make it clear to me as to what precision value would be most suited to such a value:

Index precision configuration is more useful with string ranges. Since strings can be any arbitrary length, the choice of the index precision can impact the performance of string range queries, and impact the amount of index storage space required. String range indexes can be configured with 1-100 or -1 ("maximum"). If you would like to perform Order By queries against string properties, then you must specify a precision of -1 for the corresponding paths.

回答1:

You can fine-tune the indexing precision value depending on the number of documents you expect to contain the path for your property key (which happens to be the Key property in your example).

The indexing precision for a hash index indicates the number of bytes to hash the property value to. Thus, lowering the precision value helps optimize the amount of storage required to store the index. Raising the precision value (in the context of a hash index) helps guard against hash collisions on the index.

For example, let's assume a hash index precision value of 3 on the path foo.

3 bytes = 3 * 8 = 24 bits.

24 bits can support: 2^24 = 16,777,216 values

By pigeonhole principle, you are guaranteed to have a hash collision when storing >16,777,216 documents with a foo property. Upon a hash collision, DocumentDB will then need to perform a scan on the subset of documents found. For example, if you had 30,000,000 documents with a foo property - you can expect to scan across 2 documents on average.

来源：https://stackoverflow.com/questions/32732858/documentdb-guid-index-precision

标签

azure

azure-cosmosdb