Opposite of Bloom filter?

有些话、适合烂在心里 提交于 2019-11-29 19:57:07

Yes, a lossy hash table or a LRUCache is a data structure with fast O(1) lookup that will only give false negatives -- if you ask if "Have I run test X", it will tell you either "Yes, you definitely have", or "I can't remember".

Forgive the extremely crude pseudocode:

setup_test_table():
    create test_table( some large number of entries )
    clear each entry( test_table, NEVER )
    return test_table

has_test_been_run_before( new_test_details, test_table ):
    index = hash( test_details , test_table.length )
    old_details = test_table[index].detail
    // unconditionally overwrite old details with new details, LRU fashion.
    // perhaps some other collision resolution technique might be better.
    test_table[index].details = new_test_details
    if ( old_details === test_details ) return YES
    else if ( old_details === NEVER ) return NEVER
    else return PERHAPS    

main()
    test_table = setup_test_table();
    loop
        test_details = generate_random_test()
        status = has_test_been_run_before( test_details, test_table )
        case status of
           YES: do nothing;
           NEVER: run test (test_details);
           PERHAPS: if( rand()&1 ) run test (test_details);
    next loop
end.

The exact data structure that accomplishes this task is a Direct-mapped cache, and is commonly used in CPUs.

function set_member(set, item)
    set[hash(item) % set.length] = item

function is_member(set, item)
    return set[hash(item) % set.length] == item

Is it possible to store the tests that you did not run? This should inverse the filter's behavior.

  1. Use a bit set, as mentioned above. If you know the no. of tests you are going to run beforehand, you will always get correct results (present, not-present) from the data structure.
  2. Do you know what keys you will be hashing? If so, you should run an experiment to see the distribution of the keys in the BloomFilter so you can fine tune it to reproduce false positives, or what have you.
  3. You might want to checkout HyperLogLog as well.

No and if you think about it, it wouldn't be very useful. In your case you couldn't be sure that your test run would ever stop, because if there are always 'false negatives' there will always be tests that need to be run...

I would say you just have to use a hash.

How about an LRUCache?

I'm sorry I'm not much help - I don't think its possible. If test execution can't be ordered maybe use a packed format (8 tests per byte!) or a good sparse array library for storing the outcomes in memory.

I think you're leaving out part of the solution; to avoid false positives entirely you will still have to track which have run, and essentially use the bloom filter as a shortcut to determine the a test definitely has not been run.

That said, since you know the number of tests in advance, you can size the filter in such a way as to provide an acceptable error rate using some well-known formulae; for a 1% probability of returning a false positive you need ~9.5 bits/entry, so for one million entries 1.2 megabytes is sufficient. If you reduce the acceptable error rate to 0.1%, this only increases to 1.8 MB.

The Wikipedia article Bloom Filters gives a great analysis, and an interesting overview of the maths involved.

The data structure you expect does not exist. Because such data structure must be a many-to-one mapping, or say, a limited state set. There must be at least two different inputs mapping to the same internal state. So you can't tell whether both (or more) of them are in the set, only knowing at least one of such input exists.

EDIT This statement is true only when you are looking for a memory efficient data structure. If memory is unlimited, you can always get a data structure to give 100% accurate results, by storing every member item.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!