What's better for creating distinct data structures: HashSet or Linq's Distinct()?

喜你入骨 提交于 2019-11-27 02:40:26

问题


I'm wondering whether I can get a consensus on which method is the better approach to creating a distinct set of elements: a C# HashSet or using IEnumerable's .Distinct(), which is a Linq function?

Let's say I'm looping through query results from the DB with DataReader, and my options are to add the objects I construct to a List<SomeObject> or to a HashSet<SomeObject> With the List option, I would wind up having to do something like:

myList = myList.Distinct().ToList<SomeObject>();

With the HashSet, my understanding is that adding elements to it takes care of the non-duplication by itself, assuming you've overrided the GetHashCode() and Equals() methods in SomeObject. I'm concerned mainly with the risks and performance aspects of the options.

Thanks.


回答1:


"Better" is a tricky word to use - it can mean so many different things to different people.

For readability, I would go for Distinct() as I personally find this more comprehensible.

For performance, I suspect a hand-crafted HashSet implementation might perform mildly quicker - but I doubt it would be very different as the internal implementation of Distinct will no doubt itself use some form of hashing.

For what I think of as "best" implementation... I think you should use Distinct but somehow push this down to the database layer - i.e. change the underlying database SELECT before you fill the DataReader.




回答2:


Anthony Pegram has said it the best. Use the right tool for the job. I say this because a Distinct or HashSet isn't that big different when it comes to performance. Use a HashSet when the collection should always hold only distinct stuffs. It also tells the programmer that you cant add duplicates to it. Use a normal List<T> and .Distinct() ont it when you will have to add duplicates and remove duplicates later. The intention matters.

In general,

a) a HashSet may not do any good if you're adding new objects from db and you haven't specified a custom Equals of your own. Every object from db can be a new instance for your hashset (if you are just new-ing) and that will lead to duplicates in the collection. In that case use normal List<T>.

b) If you do have an equality comparer defined for hashset, and your collection should always hold only distinct objects, use hashset.

c) If you do have an equality comparer defined for hashset, and you want only distinct objects from db but collection need not always hold only distinct objects (ie duplicates needed to be added later), a faster approach is to get the items from db to a hashset and then return a regular list from that hashset.

d) The best thing you should do is to give the task of removing duplicates to database, thats the right tool And that's first class!

As for performance differences, in my testing I always found HashSet to be faster, but then that's only marginal. That's obvious considering with List approach you have to first add and then do a distinct on it.

Test method: Starting with two general functions,

public static void Benchmark(Action method, int iterations = 10000)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < iterations; i++)
        method();

    sw.Stop();
    MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}

public static List<T> Repeat<T>(this ICollection<T> lst, int count)
{
    if (count < 0)
        throw new ArgumentOutOfRangeException("count");

    var ret = Enumerable.Empty<T>();

    for (var i = 0; i < count; i++)
        ret = ret.Concat(lst);

    return ret.ToList();
}

Implementation:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }
});

~3300 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list.Clear();
    foreach (var item in d)
    {
        list.Add(item);
    }

    list = list.Distinct().ToList();
});

~5800 ms

A difference of 2.5 seconds is not bad for a list of 10000 objects when iterated another 10000 times. For normal cases the difference will be hardly noticeable.

The best approach possibly for you with your current design:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }

    list = hash.ToList();
});

~3300 ms

There isn't any significant difference, see..


Partly unrelated - after posting this answer, I was curious to know what's the best approach in removing duplicates, from a normal list.

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash = new HashSet<int>(d);
});

~3900 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list = d.Distinct().ToList();
});

~3200 ms

Here the right tool Distinct is faster than hackish HashSet! Perhaps its the overhead of creating a hash set.


I have tested with various other combinations like reference types, without duplicates in original list etc. The results are consistent.




回答3:


What's better is what's the most expressive of describing your intention. The internal implementation details are more or less going to be the same, the difference being "who's writing the code?"

If your intention is to create from the ground up a distinct collection of items from a source that is not a collection of said items, I would argue for the HashSet<T>. You have to create the item, you have to build the collection, you might as well build the right one from the beginning.

Otherwise, if you already have a collection of items and you want to eliminate duplicates, I would argue for invoking Distinct(). You already have a collection, you just want an expressive way to get the distinct items out of it.




回答4:


For large collections HashSet is likely to be faster. It relies on the hashcode of the objects to quickly determine whether or not an element already exists in the set.

In practice, it (most likely) won't matter (but you should measure if you care).

I instinctively guessed at first that HashSet would be faster, because of the fast hash checking it uses. However, I looked up the current (4.0) implementation of Distinct in the reference sources, and it uses a similar Set class (which also relies on hashing) under the covers. Conclusion; there are no practical performance difference.

For your case, I would go with .Distinct for readability - it clearly conveys the intent of the code. However, I agree with one of the other answers, that you probably should perform this operationn in the DB if possible.




回答5:


If yor looping through the results of a DbReader adding your resutls to a Hashset would be better than adding it to a List and than doing a Distinct on that. You would save one itteration. (Distinct internally uses a HashSet)




回答6:


The implementation of Distinct may use HashSet. Take a look at Jon Skeet's Edulinq implementation.



来源:https://stackoverflow.com/questions/6298679/whats-better-for-creating-distinct-data-structures-hashset-or-linqs-distinct

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!