Opposite of Bloom filter?

问题

I'm trying to optimize a piece of software which is basically running millions of tests. These tests are generated in such a way that there can be some repetitions. Of course, I don't want to spend time running tests which I already ran if I can avoid it efficiently.

So, I'm thinking about using a Bloom filter to store the tests which have been already ran. However, the Bloom filter errs on the unsafe side for me. It gives false positives. That is, it may report that I've ran a test which I haven't. Although this could be acceptable in the scenario I'm working on, I was wondering if there's an equivalent to a Bloom filter, but erring on the opposite side, that is, only giving false negatives.

I've skimmed through the literature without any luck.

回答1:

Yes, a lossy hash table or a LRUCache is a data structure with fast O(1) lookup that will only give false negatives -- if you ask if "Have I run test X", it will tell you either "Yes, you definitely have", or "I can't remember".

Forgive the extremely crude pseudocode:

setup_test_table():
    create test_table( some large number of entries )
    clear each entry( test_table, NEVER )
    return test_table

has_test_been_run_before( new_test_details, test_table ):
    index = hash( test_details , test_table.length )
    old_details = test_table[index].detail
    // unconditionally overwrite old details with new details, LRU fashion.
    // perhaps some other collision resolution technique might be better.
    test_table[index].details = new_test_details
    if ( old_details === test_details ) return YES
    else if ( old_details === NEVER ) return NEVER
    else return PERHAPS    

main()
    test_table = setup_test_table();
    loop
        test_details = generate_random_test()
        status = has_test_been_run_before( test_details, test_table )
        case status of
           YES: do nothing;
           NEVER: run test (test_details);
           PERHAPS: if( rand()&1 ) run test (test_details);
    next loop
end.

回答2:

The exact data structure that accomplishes this task is a Direct-mapped cache, and is commonly used in CPUs.

function set_member(set, item)
    set[hash(item) % set.length] = item

function is_member(set, item)
    return set[hash(item) % set.length] == item

回答3:

Is it possible to store the tests that you did not run? This should inverse the filter's behavior.

回答4:

Use a bit set, as mentioned above. If you know the no. of tests you are going to run beforehand, you will always get correct results (present, not-present) from the data structure.
Do you know what keys you will be hashing? If so, you should run an experiment to see the distribution of the keys in the BloomFilter so you can fine tune it to reproduce false positives, or what have you.
You might want to checkout HyperLogLog as well.

回答5:

No and if you think about it, it wouldn't be very useful. In your case you couldn't be sure that your test run would ever stop, because if there are always 'false negatives' there will always be tests that need to be run...

I would say you just have to use a hash.

回答6:

How about an LRUCache?

回答7:

I'm sorry I'm not much help - I don't think its possible. If test execution can't be ordered maybe use a packed format (8 tests per byte!) or a good sparse array library for storing the outcomes in memory.

回答8:

I think you're leaving out part of the solution; to avoid false positives entirely you will still have to track which have run, and essentially use the bloom filter as a shortcut to determine the a test definitely has not been run.

That said, since you know the number of tests in advance, you can size the filter in such a way as to provide an acceptable error rate using some well-known formulae; for a 1% probability of returning a false positive you need ~9.5 bits/entry, so for one million entries 1.2 megabytes is sufficient. If you reduce the acceptable error rate to 0.1%, this only increases to 1.8 MB.

The Wikipedia article Bloom Filters gives a great analysis, and an interesting overview of the maths involved.

回答9:

The data structure you expect does not exist. Because such data structure must be a many-to-one mapping, or say, a limited state set. There must be at least two different inputs mapping to the same internal state. So you can't tell whether both (or more) of them are in the set, only knowing at least one of such input exists.

EDIT This statement is true only when you are looking for a memory efficient data structure. If memory is unlimited, you can always get a data structure to give 100% accurate results, by storing every member item.

来源：https://stackoverflow.com/questions/635728/opposite-of-bloom-filter

标签

data-structures

bloom-filter