Data structure to build and lookup set of integer ranges

前端 未结 7 1625
你的背包
你的背包 2020-12-19 09:58

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable ord

7条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-19 10:25

    Regarding the second issue:

    You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).

    In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.

    Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.

    • single elements may be best stored in a hash-table
    • actual ranges can be stored in a sorted array

    This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10

    I would therefore suggest the following steps:

    • Bloom Filter: if no, stop
    • Sorted Array of real ranges: if yes, stop
    • Hash Table of single elements

    The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)

    One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)

提交回复
热议问题