bloom-filter

Using hash functions with Bloom filters

岁酱吖の 提交于 2019-11-30 16:13:55
问题 A bloom filter uses a hash function (or many) to generate a value between 0 and m given an input string X. My question is how to you use a hash function to generate a value in this way, for example an MD5 hash is typically represented by a 32 length hex string, how would I use an MD5 hashing algorithm to generate a value between 0 and m where I can specify m? I'm using Java at the moment so an example of to do this with the MessageDigest functionality it offers would be great, though just a

python bit array (performant)

耗尽温柔 提交于 2019-11-30 04:51:16
问题 I'm designing a bloom filter and I'm wondering what the most performant bit array implementation is in Python. The nice thing about Python is that it can handle arbitrary length integers out of the box and that's what I use now, but I don't know enough about Python internals to know if that's the most performant way to do it in Python. I found bitarray but it handles a lot of other things like slicing, which I don't need. I only need the & and | and << operations. 回答1: The built-in int is

Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

≯℡__Kan透↙ 提交于 2019-11-29 22:18:37
I am working with a large set (5-20 million) of String keys (average length 10 chars) which I need to store in an in memory data structure that supports the following operation in constant time or near constant time: // Returns true if the input is present in the container, false otherwise public boolean contains(String input) Java's Hashmap is proving to be more than satisfactory as far as throughput is concerned but is taking up a lot of memory. I am looking for a solution that is memory efficient and still supports a throughput that is decent (comparable with or nearly as good as hashing).

Opposite of Bloom filter?

有些话、适合烂在心里 提交于 2019-11-29 19:57:07
I'm trying to optimize a piece of software which is basically running millions of tests. These tests are generated in such a way that there can be some repetitions. Of course, I don't want to spend time running tests which I already ran if I can avoid it efficiently. So, I'm thinking about using a Bloom filter to store the tests which have been already ran. However, the Bloom filter errs on the unsafe side for me. It gives false positives. That is, it may report that I've ran a test which I haven't. Although this could be acceptable in the scenario I'm working on, I was wondering if there's an

Which hash functions to use in a Bloom filter

爱⌒轻易说出口 提交于 2019-11-28 21:26:30
I've got the following question about choosing hash functions for Bloom filters: Which functions to use? In nearly every document/paper you can read that the hash functions used in a Bloom filter should be independent and uniformly distributed. I know what is meant by this (independent and uniformly distributed), but I'm having trouble to find a argumentation or a discussion, which hash functions fulfill those requirements and are therefore suitable. In a lot of posts I've read about suggestions for the usage of the FNV or Murmur hash function , but not why (or at least without a proof) they

How many hash functions does my bloom filter need?

我的未来我决定 提交于 2019-11-28 17:37:36
Wikipedia says: An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution. I read the article, but what I don't understand is how k is determined. Is it a function of the table size? Also, in hash tables I've written I used a simple but effective algorithm for automatically growing the hash's size. Basically, if ever more than 50% of the buckets in the table were filled, I would double the size of the table. I suspect you

Modern, high performance bloom filter in Python?

狂风中的少年 提交于 2019-11-28 13:30:14
问题 I'm looking for a production quality bloom filter implementation in Python to handle fairly large numbers of items (say 100M to 1B items with 0.01% false positive rate). Pybloom is one option but it seems to be showing its age as it throws DeprecationWarning errors on Python 2.5 on a regular basis. Joe Gregorio also has an implementation. Requirements are fast lookup performance and stability. I'm also open to creating Python interfaces to particularly good c/c++ implementations, or even to

What is the advantage to using bloom filters?

我与影子孤独终老i 提交于 2019-11-28 02:37:45
I am reading up on bloom filters and they just seem silly. Anything you can accomplish with a bloom filter, you could accomplish in less space, more efficiently, using a single hash function rather than multiple, or that's what it seems. Why would you use a bloom filter and how is it useful? From Wikipedia : Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require

Which hash functions to use in a Bloom filter

╄→гoц情女王★ 提交于 2019-11-27 13:48:31
问题 I've got the following question about choosing hash functions for Bloom filters: Which functions to use? In nearly every document/paper you can read that the hash functions used in a Bloom filter should be independent and uniformly distributed. I know what is meant by this (independent and uniformly distributed), but I'm having trouble to find a argumentation or a discussion, which hash functions fulfill those requirements and are therefore suitable. In a lot of posts I've read about

How many hash functions does my bloom filter need?

[亡魂溺海] 提交于 2019-11-27 10:10:36
问题 Wikipedia says: An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution. I read the article, but what I don't understand is how k is determined. Is it a function of the table size? Also, in hash tables I've written I used a simple but effective algorithm for automatically growing the hash's size. Basically, if ever more