Fastest datastructure for filtering schema-less collections

问题

Lets say I have a collection

var data = [
  { fieldA: 5 },
  { fieldA: 142, fieldB: 'string' },
  { fieldA: 1324, fieldC: 'string' },
  { fieldB: 'string', fieldD: 111, fieldZ: 'somestring' },
  ...
];

Lets assume fields are not uniform across elements but I know in advance the number of unique fields, and that the collection is not dynamic.

I want to filter it with something like _.findWhere. This is simple enough, but what if I want to prioritize speed over ease? Is there a better data structure that will always minimize the number of elements that will be checked? Perhaps some kind of tree?

回答1:

Yes, there is something faster if your queries are of the type "give me all records with fieldX=valueY". However, it does have an overhead.

For each field, build an inverted index that lists all the record-ids ( = row positions in the original data) that have each value:

var indexForEachField = {
    fieldA: { "5": [0], "142": [1], "1324": [2]},
    ...
}

When someone asks for "records where fieldX=valueY", you return

indexForEachField["fieldX"]["valueY"]; // an array with all results

Lookup time is therefore constant (and requires only 2 lookups in tables), but you do need to keep your index up to date.

This is a generalization of the strategy used by search engines to look up webpages with certain terms; in that scenario, it is called an inverted index.

Edit: what if you want to find all records with fieldX=valueX and fieldY=valueY?

You would use the following code, which requires all input arrays to be sorted:

var a = indexForEachField["fieldX"]["valueX"];
var b = indexForEachField["fieldY"]["valueY"];
var c = []; // result array: all elements in a AND in b
for (var i=0, j=0; i<a.length && j<b.length; /**/) {
    if (a[i] < b[j]) {
       i++;
    } else if (a[i] > b[j]) {
       j++;
    } else {
       c.push(a[i]);
       i++; j++;
    }
}

You can see that, in the worst case, the total complexity is exactly a.length + b.length; and, in the best case, half of that. You can use something very similar to implement OR.

来源：https://stackoverflow.com/questions/29135542/fastest-datastructure-for-filtering-schema-less-collections

标签

javascript

algorithm

data-structures

collections