Efficient data structure for storing a long sequence of (mostly consecutive) integers

前端 未结 5 452
礼貌的吻别
礼貌的吻别 2021-01-04 07:00

I\'d like a data structure to efficiently store a long sequence of numbers. The numbers should always be whole integers, let\'s say Longs.

The feature of the inputs

5条回答
  •  余生分开走
    2021-01-04 07:44

    The problem with storing each interval as a (start,end) couple is that if you add a new range that encompasses N previously stored intervals, you have to destroy each of these intervals, which takes N steps, whether the intervals are stored in a tree, a rope or a linked list.

    (You could leave them for automatic garbage collection, but that will take time too, and only works in some languages.)

    A possible solution for this could be to store the values (not the start and end point of intervals) in an N-ary tree, where each node represents a range, and stores two N-bit maps, representing N sub-ranges and whether the values in those sub-ranges are all present, all absent, or mixed. In the case of mixed, there would be a pointer to a child node which represents this rub-range.

    Example: (using a tree with branching factor 8 and height 2)

    full range: 0-511 ; store interval 100-300
    
    0-511:
      0- 63   64-127  128-191  192-255  256-319  320-383  384-447  448-511
       0      mixed      1        1      mixed      0        0        0
    
    64-127:
     64- 71   72- 79   80- 87   88- 95   96-103  104-111  112-119  120-127
       0        0        0        0      mixed      1        1        1
    
    96-103:
     96   97   98   99  100  101  102  103
      0    0    0    0    1    1    1    1
    
    256-319:
    256-263  264-271  272-279  280-287  288-295  296-303  304-311  312-319
       1        1        1        1        1      mixed      0        0
    
    296-303:
    296  297  298  299  300  301  302  303
      1    1    1    1    1    0    0    0
    

    So the tree would contain these five nodes:

    - values: 00110000, mixed: 01001000, 2 pointers to sub-nodes  
      - values: 00000111, mixed: 00001000, 1 pointer to sub-node
        - values: 00001111, mixed: 00000000
      - values: 11111000, mixed: 00000100, 1 pointer to sub-node
        - values: 11111000, mixed: 00000000
    

    The point of storing the interval this way is that you can discard an interval without having to actually delete it. Let's say you add a new range 200-400; in that case, you'd change the range 256-319 in the root node from "mixed" to "1", without deleting or updating the 256-319 and 296-303 nodes themselves; these nodes can be kept for later re-use (or disconnected and put in a queue of re-usable sub-trees, or deleted in a programmed garbage-collection when the programme is idling or running low on memory).

    When looking up an interval, you only have to go as deep down the tree as necessary; when looking up e.g. 225-275, you'd find that 192-255 is all-present, 256-319 is mixed, 256-263 and 264-271 and 272-279 are all-present, and you'd know the answer is true. Since these values would be stored as bitmaps (one for present/absent, one for mixed/solid), all the values in a node could be checked with only a few bitwise comparisons.

    Re-using nodes:

    If a node has a child node, and the corresponding value is later set from mixed to all-absent or all-present, the child node no longer holds relevant values (but it is being ignored). When the value is changed back to mixed, the child node can be updated by setting all its values to its value in the parent node (before it was changed to mixed) and then making the necessary changes.

    In the example above, if we add the range 0-200, this will change the tree to:

    - values: 11110000, mixed: 00001000, 2 pointers to sub-nodes  
      - (values: 00000111, mixed: 00001000, 1 pointer to sub-node)
        - (values: 00001111, mixed: 00000000)
      - values: 11111000, mixed: 00000100, 1 pointer to sub-node
        - values: 11111000, mixed: 00000000
    

    The second and third node now contain outdated values, and are being ignored. If we then delete the range 80-95, the value for range 64-127 in the root node is changed to mixed again, and the node for range 64-127 is re-used. First we set all values in it to all-present (because that was the previous value of the parent node), and then we set the values for 80-87 and 88-95 to all-absent. The third node, for range 96-103 remains out-of-use.

    - values: 00110000, mixed: 01001000, 2 pointers to sub-nodes  
      - values: 11001111, mixed: 00000000, 1 pointer to sub-node
        - (values: 00001111, mixed: 00000000)
      - values: 11111000, mixed: 00000100, 1 pointer to sub-node
        - values: 11111000, mixed: 00000000
    

    If we then added value 100, the value for range 96-103 in the second node would be changed to mixed again, and the third node would be updated to all-absent (its previous value in the second node) and then value 100 would be set to present:

    - values: 00110000, mixed: 01001000, 2 pointers to sub-nodes  
      - values: 11000111, mixed: 00001000, 1 pointer to sub-node
        - values: 00001000, mixed: 00000000
      - values: 11111000, mixed: 00000100, 1 pointer to sub-node
        - values: 11111000, mixed: 00000000
    

    At first it may seem that this data structure uses a lot of storage space compared to solutions which store the intervals as (start,end) pairs. However, let's look at the (theoretical) worst-case scenario, where every even number is present and every odd number is absent, across the whole 64-bit range:

    Total range: 0 ~ 18,446,744,073,709,551,615  
    Intervals:        9,223,372,036,854,775,808  
    

    A data structure which stores these as (start,end) pairs would use:

    Nodes:            9,223,372,036,854,775,808  
    Size of node:                            16 bytes  
    TOTAL:          147,573,952,589,676,412,928 bytes
    

    If the data structure uses nodes which are linked via (64-bit) pointers, that would add:

    Data:           147,573,952,589,676,412,928 bytes
    Pointers:        73,786,976,294,838,206,456 bytes
    TOTAL:          221,360,928,884,514,619,384 bytes
    

    An N-ary tree with branching factor 64 (and 16 for the last level, to get a total range of 10×6 + 1×4 = 64 bits) would use:

    Nodes (branch):         285,942,833,483,841  
    Size of branch:                         528 bytes  
    Nodes (leaf):        18,014,398,509,481,984  
    Size of leaf:                           144 bytes  
    TOTAL:            2,745,051,201,444,873,744 bytes  
    

    which is 53.76 times less than (start,end) pair structures (or 80.64 times less including pointers).

    The calculation was done with the following N-ary tree:

    Branch (9 levels):
    value: 64-bit map
    mixed: 64-bit map
    sub-nodes: 64 pointers
    TOTAL: 528 bytes
    
    Leaf:
    value: 64-bit map
    mixed: 64-bit map
    sub-nodes: 64 16-bit maps (more efficient than pointing to sub-node)
    TOTAL: 144 bytes
    

    This is of course a worst-case comparison; the average case would depend very much on the specific input.


    Here's a first code example I wrote to test the idea. The nodes have branching factor 16, so that every level stores 4 bits of the integers, and common bit depths can be obtained without different leaves and branches. As an example, a tree of depth 3 is created, representing a range of 4×4 = 16 bits.

    function setBits(pattern, value, mask) {                 // helper function (make inline)
        return (pattern & ~mask) | (value ? mask : 0);
    }
    
    function Node(value) { // CONSTRUCTOR
        this.value = value ? 65535 : 0;                      // set all values to 1 or 0
        this.mixed = 0;                                      // set all to non-mixed
        this.sub = null;                                     // no pointer array yet
    }
    
    Node.prototype.prepareSub = function(pos, mask, value) {
        if ((this.mixed & mask) == 0) {                      // not mixed, child possibly outdated
            var prev = (this.value & mask) >> pos;
            if (value == prev) return false;                 // child doesn't require setting
            if (!this.sub) this.sub = [];                    // create array of pointers
            if (this.sub[pos]) {
                this.sub[pos].value = prev ? 65535 : 0;      // update child node values
                this.sub[pos].mixed = 0;
            }
            else this.sub[pos] = new Node(prev);             // create new child node
        }
        return true;                                         // child requires setting
    }
    
    Node.prototype.set = function(value, a, b, step) {
        var posA = Math.floor(a / step), posB = Math.floor(b / step);
        var maskA = 1 << posA, maskB = 1 << posB;
        a %= step; b %= step;
    
        if (step == 1) {                                     // node is leaf
            var vMask = (maskB | (maskB - 1)) ^ (maskA - 1); // bits posA to posB inclusive
            this.value = setBits(this.value, value, vMask);
        }
        else if (posA == posB) {                             // only 1 sub-range to be set
            if (a == 0 && b == step - 1)                     // a-b is full sub-range
            {
                this.value = setBits(this.value, value, maskA);
                this.mixed = setBits(this.mixed, 0, maskA);
            }
            else if (this.prepareSub(posA, maskA, value)) {  // child node requires setting
                var solid = this.sub[posA].set(value, a, b, step >> 4);     // recurse
                this.value = setBits(this.value, solid ? value : 0, maskA); // set value
                this.mixed = setBits(this.mixed, solid ? 0 : 1, maskA);     // set mixed
            }
        }
        else {                                               // multiple sub-ranges to set
            var vMask = (maskB - 1) ^ (maskA | (maskA - 1)); // bits between posA and posB
            this.value = setBits(this.value, value, vMask);  // set inbetween values
            this.mixed &= ~vMask;                            // set inbetween to solid
            var solidA = true, solidB = true;
            if (a != 0 && this.prepareSub(posA, maskA, value)) {          // child needs setting
                solidA = this.sub[posA].set(value, a, step - 1, step >> 4);
            }
            if (b != step - 1 && this.prepareSub(posB, maskB, value)) {   // child needs setting
                solidB = this.sub[posB].set(value, 0, b, step >> 4);
            }
            this.value = setBits(this.value, solidA ? value : 0, maskA);  // set value
            this.value = setBits(this.value, solidB ? value : 0, maskB);
            if (solidA) this.mixed &= ~maskA; else this.mixed |= maskA;   // set mixed
            if (solidB) this.mixed &= ~maskB; else this.mixed |= maskB;
        }
        return this.mixed == 0 && this.value == 0 || this.value == 65535; // solid or mixed
    }
    
    Node.prototype.check = function(a, b, step) {
        var posA = Math.floor(a / step), posB = Math.floor(b / step);
        var maskA = 1 << posA, maskB = 1 << posB;
        a %= step; b %= step;
        var vMask = (maskB - 1) ^ (maskA | (maskA - 1));     // bits between posA and posB
    
        if (step == 1) {
            vMask = posA == posB ? maskA : vMask | maskA | maskB;
            return (this.value & vMask) == vMask;
        }
        if (posA == posB) {
            var present = (this.mixed & maskA) ? this.sub[posA].check(a, b, step >> 4) : this.value & maskA;
            return !!present;
        }
        var present = (this.mixed & maskA) ? this.sub[posA].check(a, step - 1, step >> 4) : this.value & maskA;
        if (!present) return false;
        present = (this.mixed & maskB) ? this.sub[posB].check(0, b, step >> 4) : this.value & maskB;
        if (!present) return false;
        return (this.value & vMask) == vMask;
    }
    
    function NaryTree(factor, depth) { // CONSTRUCTOR
        this.root = new Node();
        this.step = Math.pow(factor, depth);
    }
    
    NaryTree.prototype.addRange = function(a, b) {
        this.root.set(1, a, b, this.step);
    }
    
    NaryTree.prototype.delRange = function(a, b) {
        this.root.set(0, a, b, this.step);
    }
    
    NaryTree.prototype.hasRange = function(a, b) {
        return this.root.check(a, b, this.step);
    }
    
    var intervals = new NaryTree(16, 3);                  // create tree for 16-bit range
    
    // CREATE RANDOM DATA AND RUN TEST
    document.write("Created N-ary tree for 16-bit range.
    Randomly adding/deleting 100000 intervals..."); for (var test = 0; test < 100000; test++) { var a = Math.floor(Math.random() * 61440); var b = a + Math.floor(Math.random() * 4096); if (Math.random() > 0.5) intervals.addRange(a,b); else intervals.delRange(a,b); } document.write("
    Checking a few random intervals:
    "); for (var test = 0; test < 8; test++) { var a = Math.floor(Math.random() * 65280); var b = a + Math.floor(Math.random() * 256); document.write("Tree has interval " + a + "-" + b + " ? " + intervals.hasRange(a,b),".
    "); }


    I ran a test to check how many nodes are being created, and how many of these are active or dormant. I used a total range of 24-bit (so that I could test the worst-case without running out of memory), divided into 6 levels of 4 bits (so each node has 16 sub-ranges); the number of nodes that need to be checked or updated when adding, deleting or checking an interval is 11 or less. The maximum number of nodes in this scheme is 1,118,481.

    The graph below shows the number of active nodes when you keep adding/deleting random intervals with range 1 (single integers), 1~16, 1~256 ... 1~16M (the full range).

    Adding and deleting single integers (dark green line) creates active nodes up to close to the maximum 1,118,481 nodes, with almost no nodes being made dormant. The maximum is reached after adding and deleting around 16M integers (= the number of integers in the range).

    If you add and delete random intervals in a larger range, the number of nodes that are created is roughly the same, but more of them are being made dormant. If you add random intervals in the full 1~16M range (bright yellow line), less than 64 nodes are active at any time, no matter how many intervals you keep adding or deleting.

    This already gives an idea of where this data structure could be useful as opposed to others: the more nodes are being made dormant, the more intervals/nodes would need to be deleted in other schemes.

    On the other hand, it shows how this data structure may be too space-inefficient for certain ranges, and types and amounts of input. You could introduce a dormant node recycling system, but that takes away the advantage of the dormant nodes being immediately reusable.


    A lot of space in the N-ary tree is taken up by pointers to child nodes. If the complete range is small enough, you could store the tree in an array. For a 32-bit range that would take 580 MB (546 MB for the "value" bitmaps and 34 MB for the "mixed" bitmaps). This is more space-efficient because you only store the bitmaps, and you don't need pointers to child nodes, because everything has a fixed place in the array. You'd have the advantage of a tree with depth 7, so any operation could be done by checking 15 "nodes" or fewer, and no nodes need to be created or deleted during add/delete/check operations.

    Here's a code example I used to try out the N-ary-tree-in-an-array idea; it uses 580MB to store a N-ary tree with branching factor 16 and depth 7, for a 32-bit range (unfortunately, a range above 40 bits or so is probably beyond the memory capabilities of any normal computer). In addition to the requested functions, it can also check whether an interval is completely absent, using notValue() and notRange().

    #include 
    
    inline uint16_t setBits(uint16_t pattern, uint16_t mask, uint16_t value) {
        return (pattern & ~mask) | (value & mask);
    }
    
    class NaryTree32 {
        uint16_t value[0x11111111], mixed[0x01111111];
    
        bool set(uint32_t a, uint32_t b, uint16_t value = 0xFFFF, uint8_t bits = 28, uint32_t offset = 0) {
            uint8_t posA = a >> bits, posB = b >> bits;
            uint16_t maskA = 1 << posA, maskB = 1 << posB;
            uint16_t mask = maskB ^ (maskA - 1) ^ (maskB - 1);
    
            // IF NODE IS LEAF: SET VALUE BITS AND RETURN WHETHER VALUES ARE MIXED
            if (bits == 0) {
                this->value[offset] = setBits(this->value[offset], mask, value);
                return this->value[offset] != 0 && this->value[offset] != 0xFFFF;
            }
    
            uint32_t offsetA = offset * 16 + posA + 1, offsetB = offset * 16 + posB + 1;
            uint32_t subMask = ~(0xFFFFFFFF << bits);
            a &= subMask; b &= subMask;
    
            // IF SUB-RANGE A IS MIXED OR HAS WRONG VALUE
            if (((this->mixed[offset] & maskA) != 0 || (this->value[offset] & maskA) != (value & maskA))
            && (a != 0 || posA == posB && b != subMask)) {
    
                // IF SUB-RANGE WAS PREVIOUSLY SOLID: UPDATE TO PREVIOUS VALUE
                if ((this->mixed[offset] & maskA) == 0) {
                    this->value[offsetA] = (this->value[offset] & maskA) ? 0xFFFF : 0x0000;
                    if (bits != 4) this->mixed[offsetA] = 0x0000;
                }
                // RECURSE AND IF SUB-NODE IS MIXED: SET MIXED BIT AND REMOVE A FROM MASK
                if (this->set(a, posA == posB ? b : subMask, value, bits - 4, offsetA)) {
                    this->mixed[offset] |= maskA;
                    mask ^= maskA;
                }
            }
            // IF SUB-RANGE B IS MIXED OR HAS WRONG VALUE
            if (((this->mixed[offset] & maskB) != 0 || (this->value[offset] & maskB) != (value & maskB))
            && b != subMask && posA != posB) {
    
                // IF SUB-RANGE WAS PREVIOUSLY SOLID: UPDATE SUB-NODE TO PREVIOUS VALUE
                if ((this->mixed[offset] & maskB) == 0) {
                    this->value[offsetB] = (this->value[offset] & maskB) ? 0xFFFF : 0x0000;
                    if (bits > 4) this->mixed[offsetB] = 0x0000;
                }
                // RECURSE AND IF SUB-NODE IS MIXED: SET MIXED BIT AND REMOVE A FROM MASK
                if (this->set(0, b, value, bits - 4, offsetB)) {
                    this->mixed[offset] |= maskB;
                    mask ^= maskB;
                }
            }
            // SET VALUE AND MIXED BITS THAT HAVEN'T BEEN SET YET AND RETURN WHETHER NODE IS MIXED
            if (mask) {
                this->value[offset] = setBits(this->value[offset], mask, value);
                this->mixed[offset] &= ~mask;
            }
            return this->mixed[offset] != 0 || this->value[offset] != 0 && this->value[offset] != 0xFFFF;
        }
    
        bool check(uint32_t a, uint32_t b, uint16_t value = 0xFFFF, uint8_t bits = 28, uint32_t offset = 0) {
            uint8_t posA = a >> bits, posB = b >> bits;
            uint16_t maskA = 1 << posA, maskB = 1 << posB;
            uint16_t mask = maskB ^ (maskA - 1) ^ (maskB - 1);
    
            // IF NODE IS LEAF: CHECK BITS A TO B INCLUSIVE AND RETURN
            if (bits == 0) {
                return (this->value[offset] & mask) == (value & mask);
            }
    
            uint32_t subMask = ~(0xFFFFFFFF << bits);
            a &= subMask; b &= subMask;
    
            // IF SUB-RANGE A IS MIXED AND PART OF IT NEEDS CHECKING: RECURSE AND RETURN IF FALSE
            if ((this->mixed[offset] & maskA) && (a != 0 || posA == posB && b != subMask)) {
                if (this->check(a, posA == posB ? b : subMask, value, bits - 4, offset * 16 + posA + 1)) {
                    mask ^= maskA;
                }
                else return false;
            }
            // IF SUB-RANGE B IS MIXED AND PART OF IT NEEDS CHECKING: RECURSE AND RETURN IF FALSE
            if (posA != posB && (this->mixed[offset] & maskB) && b != subMask) {
                if (this->check(0, b, value, bits - 4, offset * 16 + posB + 1)) {
                    mask ^= maskB;
                }
                else return false;
            }
            // CHECK INBETWEEN BITS (AND A AND/OR B IF NOT YET CHECKED) WHETHER SOLID AND CORRECT
            return (this->mixed[offset] & mask) == 0 && (this->value[offset] & mask) == (value & mask);
        }
    
        public:
    
        NaryTree32() { // CONSTRUCTOR: INITIALISES ROOT NODE
            this->value[0] = 0x0000;
            this->mixed[0] = 0x0000;
        }
    
        void addValue(uint32_t a)             {this->set(a, a);}
        void addRange(uint32_t a, uint32_t b) {this->set(a, b);}
        void delValue(uint32_t a)             {this->set(a, a, 0);}
        void delRange(uint32_t a, uint32_t b) {this->set(a, b, 0);}
        bool hasValue(uint32_t a)             {return this->check(a, a);}
        bool hasRange(uint32_t a, uint32_t b) {return this->check(a, b);}
        bool notValue(uint32_t a)             {return this->check(a, a, 0);}
        bool notRange(uint32_t a, uint32_t b) {return this->check(a, b, 0);}
    };
    
    int main() {
        NaryTree32 *tree = new NaryTree32();
    
        tree->addRange(4294967280, 4294967295);
        std::cout << tree->hasRange(4294967280, 4294967295) << "\n";
        tree->delValue(4294967290);
        std::cout << tree->hasRange(4294967280, 4294967295) << "\n";
        tree->addRange(1000000000, 4294967295);
        std::cout << tree->hasRange(4294967280, 4294967295) << "\n";
        tree->delRange(2000000000, 4294967280);
        std::cout << tree->hasRange(4294967280, 4294967295) << "\n";
    
        return 0;
    }
    

提交回复
热议问题