Remove duplicates from array of objects

*爱你&永不变心* 提交于 2019-12-10 18:08:03

问题


I have a class called Customer that has several string properties like

firstName, lastName, email, etc.  

I read in the customer information from a csv file that creates an array of the class:

Customer[] customers  

I need to remove the duplicate customers having the same email address, leaving only 1 customer record for each particular email address.

I have done this using 2 loops but it takes nearly 5 minutes as there are usually 50,000+ customer records. Once I am done removing the duplicates, I need to write the customer information to another csv file (no help needed here).

If I did a Distinct in a loop how would I remove the other string variables that are a part of the class for that particular customer as well?

Thanks, Andrew


回答1:


With Linq, you can do this in O(n) time (single level loop) with a GroupBy

var uniquePersons = persons.GroupBy(p => p.Email)
                           .Select(grp => grp.First())
                           .ToArray();

Update

A bit on O(n) behavior of GroupBy.

GroupBy is implemented in Linq (Enumerable.cs) as this -

The IEnumerable is iterated only once to create the grouping. A Hash of the key provided (e.g. "Email" here) is used to find unique keys, and the elements are added in the Grouping corresponding to the keys.

Please see this GetGrouping code. And some old posts for reference.

  • What's the asymptotic complexity of GroupBy operation?
  • What guarantees are there on the run-time complexity (Big-O) of LINQ methods?

Then Select is obviously an O(n) code, making the above code O(n) overall.

Update 2

To handle empty/null values.

So, if there are instances where the value of Email is null or empty, the simple GroupBy will take just one of those objects from null & empty each.

One quick way to include all those objects with null/empty value is to use some unique keys at the run time for those objects, like

var tempEmailIndex = 0;
var uniqueNullAndEmpty = persons
                         .GroupBy(p => string.IsNullOrEmpty(p.Email) 
                                       ? (++tempEmailIndex).ToString() : p.Email)
                         .Select(grp => grp.First())
                         .ToArray();



回答2:


I'd do it like this:

public class Person {
    public Person(string eMail, string Name) {
        this.eMail = eMail;
        this.Name = Name;
    }
    public string eMail { get; set; }
    public string Name { get; set; }
}
public class eMailKeyedCollection : System.Collections.ObjectModel.KeyedCollection<string, Person> {
    protected override string GetKeyForItem(Person item) {
        return item.eMail;
    }
}

public void testIt() {
    var testArr = new Person[5];
    testArr[0] = new Person("Jon@Mullen.com", "Jon Mullen");
    testArr[1] = new Person("Jane@Cullen.com", "Jane Cullen");
    testArr[2] = new Person("Jon@Cullen.com", "Jon Cullen");
    testArr[3] = new Person("John@Mullen.com", "John Mullen");
    testArr[4] = new Person("Jon@Mullen.com", "Test Other"); //same eMail as index 0...

    var targetList = new eMailKeyedCollection();
    foreach (var p in testArr) {
        if (!targetList.Contains(p.eMail))
            targetList.Add(p);
    }
}

If the item is found in the collection, you could easily pick (and eventually modify) it with:

        if (!targetList.Contains(p.eMail))
            targetList.Add(p);
        else {
           var currentPerson=targetList[p.eMail];
           //modify Name, Address whatever... 
        }


来源:https://stackoverflow.com/questions/34142463/remove-duplicates-from-array-of-objects

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!