hashset

Java: optimize hashset for large-scale duplicate detection

99封情书 提交于 2019-12-03 16:58:37
问题 I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320" I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with tweetids = new HashSet<String>(220000,0

Why initialize HashSet<>(0) to zero?

≯℡__Kan透↙ 提交于 2019-12-03 16:23:26
问题 I love a HashSet<>() and use this eagerly while initializing this with the default constructor: Set<Users> users = new HashSet<>(); Now, my automatic bean creator (JBoss tools) initializes this as: Set<Users> users = new HashSet<>(0); Why the zero ? The API tells me that this is the initial capacity , but what is the advantage of putting this to zero? Is this advised? 回答1: The default initial capacity is 16, so by passing in 0 you may save a few bytes of memory if you end up not putting

How HashSet works with regards to hashCode()?

本秂侑毒 提交于 2019-12-03 13:32:28
I'm trying to understand java.util.Collection and java.util.Map a little deeper but I have some doubts about HashSet funcionality: In the documentation, it says: This class implements the Set interface, backed by a hash table (actually a HashMap instance). Ok, so I can see that a HashSet always has a Hashtable working in background. A hashtable is a struct that asks for a key and a value everytime you want to add a new element to it. Then, the value and the key are stored in a bucket based on the key hashCode. If the hashcodes of two keys are the same, they add both key values to the same

Efficient way to clone a HashSet<T>?

守給你的承諾、 提交于 2019-12-03 11:33:26
问题 A few days ago, I answered an interesting question on SO about HashSet<T> . A possible solution involved cloning the hashset, and in my answer I suggested to do something like this: HashSet<int> original = ... HashSet<int> clone = new HashSet<int>(original); Although this approach is quite straightforward, I suspect it's very inefficient: the constructor of the new HashSet<T> needs to separately add each item from the original hashset, and check if it isn't already present . This is clearly a

HashSet load factor

半世苍凉 提交于 2019-12-03 11:17:23
问题 If I use a HashSet with a initial capacity of 10 and a load factor of 0.5 then every 5 elements added the HashSet will be increased or first the HashSet is increased of 10 elements and after at 15 at 20 atc. the capacity will be increased? 回答1: The load factor is a measure of how full the HashSet is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is

Moving data from a HashSet to ArrayList in Java

房东的猫 提交于 2019-12-03 11:14:51
I have the following Set in Java: Set< Set<String> > SetTemp = new HashSet< Set<String> >(); and I want to move its data to an ArrayList : ArrayList< ArrayList< String > > List = new ArrayList< ArrayList< String > >); Is it possible to do that ? You simply need to loop: Set<Set<String>> setTemp = new HashSet<Set<String>> (); List<List<String>> list = new ArrayList<List<String>> (); for (Set<String> subset : setTemp) { list.add(new ArrayList<String> (subset)); } Note: you should start variable names in small caps to follow Java conventions. abhishek ringsia Moving Data HashSet to ArrayList Set

How to avoid unchecked cast warning when cloning a HashSet?

拈花ヽ惹草 提交于 2019-12-03 10:57:36
问题 I'm trying to make a shallow copy of a HashSet of Points called myHash. As of now, I have the following: HashSet<Point> myNewHash = (HashSet<Point>) myHash.clone(); This code gives me an unchecked cast warning however. Is there a better way to do this? 回答1: You can try this: HashSet<Point> myNewHash = new HashSet<Point>(myHash); 回答2: A different answer suggests using new HashSet<Point>(myHash) . However, the intent of clone() is to obtain a new object of the same type. If myHash is an

How to convert List<T> to HashSet<T> in C#? [duplicate]

夙愿已清 提交于 2019-12-03 10:26:05
This question already has an answer here: Convert an array to a HashSet<T> in .NET 7 answers I have a List that has duplicates of objects. To solve that, I need to convert the List into a HashSet (in C#). Does anyone know how? Habib Make sure your object's class overrides Equals and GetHashCode and then you can pass the List<T> to HashSet<T> constructor . var hashSet = new HashSet<YourType>(yourList); You may see: What is the best algorithm for an overridden System.Object.GetHashCode? An alternative way would be var yourlist = new List<SomeClass>(); // [...] var uniqueObjs = yourlist.Distinct(

How does HashSet not allow duplicates?

浪子不回头ぞ 提交于 2019-12-03 10:25:58
I was going through the add method of HashSet . It is mentioned that If this set already contains the element, the call leaves the set unchanged and returns false. But the add method is internally saving the values in HashMap public boolean add(E e) { return map.put(e, PRESENT)==null; } The put method of HashMap states that Associates the specified value with the specified key in this map. If the map previously contained a mapping for the key, the old value is replaced. So if the put method of HashMap replaces the old value, how the HashSet add method leaves the set unchanged in case of

Why can't I preallocate a hashset<T>

≯℡__Kan透↙ 提交于 2019-12-03 09:18:36
Why can't I preallocate a hashset<T> ? There are times when i might be adding a lot of elements to it and i want to eliminate resizing. Answer below was written in 2011. It's now in .NET 4.7.2 and .NET Core 2.0; it will be in .NET Standard 2.1. There's no technical reason why this shouldn't be possible - Microsoft just hasn't chosen to expose a constructor with an initial capacity. If you can call a constructor which takes an IEnumerable<T> and use an implementation of ICollection<T> , I believe that will use the size of the collection as the initial minimum capacity. This is an implementation