Java: optimize hashset for large-scale duplicate detection

老子叫甜甜 提交于 2019-12-03 05:59:52

You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems

  1. The number of buckets for large hashmaps and hash sets is going to cause a lot of overhead (memory). You can influence this by using some kind of custom hash function and a modulo of e.g. 50000
  2. Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
  3. HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.

Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.

An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.

If you are just looking for the existence of Strings, then I would suggest you try using a Trie(also called a Prefix Tree). The total space used by a Trie should be less than a HashSet, and it's quicker for string lookups.

The main disadvantage is that it can be slower when used from a harddisk as it's loading a tree, not a stored linearly structure like a Hash. So make sure that it can be held inside of RAM.

The link I gave is a good list of pros/cons of this approach.

*as an aside, the bloom filters suggested by Jilles Van Gurp are great fast prefilters.

Simple, untried and possibly stupid suggestion: Create a Map of Sets, indexed by the first/last N characters of the tweet ID:

Map<String, Set<String>> sets = new HashMap<String, Set<String>>();
String tweetId = "166471306949304320";
sets.put(tweetId.substr(0, 5), new HashSet<String>());
sets.get(tweetId.substr(0, 5)).add(tweetId);
assert(sets.containsKey(tweetId.substr(0, 5)) && sets.get(tweetId.substr(0, 5)).contains(tweetId));

That easily lets you keep the maximum size of the hashing space(s) below a reasonable value.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!