bloom-filter

An algorithm to find the difference of two set A and B with size n

流过昼夜 提交于 2020-08-08 18:20:26
问题 There are two set A and B, and the size of both sets is n. How to find every elements of A that is not in B (A-B), with O(n). What data structure should I use (bloom filter?) 回答1: Given that both are sets, you should use a set / hashset. This will let you compute the contains / in operation in O(1) . Bloom filters aren't good for this type of problem - they tell you if an element definitely isn't in a set of objects, but there are still chances for false positives. You're better off using a

Bloom Filter Implementation

房东的猫 提交于 2020-01-01 05:30:54
问题 Using Bloom filter, we will be getting space optimization. The cassandra framework also has an implementation of Bloom Filter. But in detail, how is this space optimization achieved? 回答1: A bloom filter isn't a "framework". It's really more like simply an algorithm. The implementation ain't very long. Here's one in Java I've tried ( .jar , source code and JavaDoc being all available): "Stand alone Java implementations of Cuckoo Hashing and Bloom Filters" (you may want to Google for this in

Opposite of Bloom filter?

旧街凉风 提交于 2019-12-29 10:17:12
问题 I'm trying to optimize a piece of software which is basically running millions of tests. These tests are generated in such a way that there can be some repetitions. Of course, I don't want to spend time running tests which I already ran if I can avoid it efficiently. So, I'm thinking about using a Bloom filter to store the tests which have been already ran. However, the Bloom filter errs on the unsafe side for me. It gives false positives. That is, it may report that I've ran a test which I

Bloom Filter: evaluating false positive rate

不想你离开。 提交于 2019-12-23 05:23:35
问题 Given a fixed number of bits (eg. slot) (m) and a fixed number of hash function (k), how one compute the theoretical false positive rate (p) ? According to Wikipedia http://en.wikipedia.org/wiki/Bloom_filter, for a false positive rate (p) and a number of item (n), the number of bits (m) needed is given by m = - n * l(p) / (l(2)^2) and the optimal number of hash function (k) is given by k = m / n * l(2) . From the formula given in Wikipedia page, I guess I could evaluate the theoretical false

Bloomfilter and Cassandra = Why used and why hashed several times?

末鹿安然 提交于 2019-12-21 11:36:53
问题 I Read this: http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html My Questions: 1.) Is it correct, that Cassandra only uses the bloom filter, to find out the SST (Sorted String Table) which most likely contains the key? As there might be several SSTs and Cassandra does not know in Which SST a key might be? So to speed this up looking in all SSTs bloomfilters are used. Is this correct? (I am trying to understand how cassandra works...) 2.) Why are (as explained in the link

Bloomfilter and Cassandra = Why used and why hashed several times?

十年热恋 提交于 2019-12-21 11:36:06
问题 I Read this: http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html My Questions: 1.) Is it correct, that Cassandra only uses the bloom filter, to find out the SST (Sorted String Table) which most likely contains the key? As there might be several SSTs and Cassandra does not know in Which SST a key might be? So to speed this up looking in all SSTs bloomfilters are used. Is this correct? (I am trying to understand how cassandra works...) 2.) Why are (as explained in the link

Is there any probabilistic data structure that gives false negatives but not false positives?

这一生的挚爱 提交于 2019-12-19 08:54:50
问题 I need a space efficient probabilistic data structure to store values that I have already computed. For me computation is cheap but space is not - so if this data structure returns a false negative, I am okay with redoing some work every once in a while but false positives are unacceptable. So what I am looking for is sort of the opposite of a Bloom filter. 回答1: For false negative you can use lossy hash table or a LRUCache. It is a data structure with fast O(1) look-up that will only give

Can Bloom Filters in BigTable be used to filter based only on row ID?

落花浮王杯 提交于 2019-12-11 03:32:37
问题 BigTable uses Bloom filters to allow point reads to avoid accessing SSTables that do not contain any data within a given key-column pair. Can these Bloom filters also be used to avoid accessing SSTables if the query only specifies the row ID and no column ID? BigTable uses row-column pairs as keys to insert into its bloom filters. This means that a query can use these filters for a point read that specifies a row-column pair. Now, suppose we have a query to get all columns of a row based only

C++ Storing a dynamic_bitset into a file

这一生的挚爱 提交于 2019-12-10 17:49:22
问题 Sort of a follow up to How does one store a vector<bool> or a bitset into a file, but bit-wise? Basically I am writing a bitset as a binary file with the follow code: boost::dynamic_bitset<boost::dynamic_bitset<>::block_type> filter; vector<boost::dynamic_bitset<>::block_type> filterBlocks(filter.num_blocks()); //populate vector blocks boost::to_block_range(filter, filterBlocks.begin()); ofstream myFile(filterFilePath.c_str(), ios::out | ios::binary); //write out each block for (vector<boost:

non-repeating random numbers

旧城冷巷雨未停 提交于 2019-12-10 10:38:37
问题 I need to generate around 9-100 million non-repeating random numbers, ranging from zero to the amount of numbers generated, and I need them to be generated very quickly. Several answers to similar questions proposed simply shuffling an array in order to get the random numbers, and others proposed using a bloom filter. The question is, which one is more efficient, and in case of it being the bloom filter, how do I use it? 回答1: You don't want random numbers at all. You want exactly the numbers