Check array for duplicates, return only items which appear more than once

问题

I have an text document of emails such as

Google12@gmail.com,
MyUSERNAME@me.com,
ME@you.com,
ratonabat@co.co,
iamcool@asd.com,
ratonabat@co.co,

I need to check said document for duplicates and create a unique array from that (so if "ratonabat@co.co" appears 500 times in the new array he'll only appear once.)

Edit: For an example:

username1@hotmail.com
username2@hotmail.com
username1@hotmail.com
username1@hotmail.com
username1@hotmail.com
username1@hotmail.com

This is my "data" (either in an array or text document, I can handle that)

I want to be able to see if there's a duplicate in that, and move the duplicate ONCE to another array. So the output would be

username1@hotmail.com

回答1:

You can simply use Linq's Distinct extension method:

var input = new string[] { ... };
var output = input.Distinct().ToArray();

You may also want to consider refactoring your code to use a HashSet<string> instead of a simple array, as it will gracefully handle duplicates.

To get an array containing only those records which are duplicates, it's a little moe complex, but you can still do it with a little Linq:

var output = input.GroupBy(x => x)
                  .Where(g => g.Skip(1).Any())
                  .Select(g => g.Key)
                  .ToArray();

Explanation:

.GroupBy group identical strings together
.Where filter the groups by the following criteria
- .Skip(1).Any() return true if there are 2 or more items in the group. This is equivalent to .Count() > 1, but it's slightly more efficient because it stops counting after it finds a second item.
.Select return a set consisting only of a single string (rather than the group)
.ToArray convert the result set to an array.

Here's another solution using a custom extension method:

public static class MyExtensions
{
    public static IEnumerable<T> Duplicates<T>(this IEnumerable<T> input)
    {
        var a = new HashSet<T>();
        var b = new HashSet<T>();
        foreach(var x in input)
        {
            if (!a.Add(x) && b.Add(x))
                yield return x;
        }
    }
}

And then you can call this method like this:

var output = input.Duplicates().ToArray();

I haven't benchmarked this, but it should be more efficient than the previous method.

回答2:

You can use the built in in .Distinct() method, by default the comparisons are case sensitive, if you want to make it case insenstive use the overload that takes a comparer in and use a case insensitive string comparer.

List<string> emailAddresses = GetListOfEmailAddresses();
string[] uniqueEmailAddresses = emailAddresses.Distinct(StringComparer.OrdinalIgnoreCase).ToArray();

EDIT: Now I see after you made your clarification you only want to list the duplicates.

string[] duplicateAddresses = emailAddresses.GroupBy(address => address,
                                                    (key, rows) => new {Key = key, Count = rows.Count()}, 
                                                    StringComparer.OrdinalIgnoreCase)
                                            .Where(row => row.Count > 1)
                                            .Select(row => row.Key)
                                            .ToArray();

回答3:

To select emails which occur more then once..

var dupEmails=from emails in File.ReadAllText(path).Split(',').GroupBy(x=>x)
              where emails.Count()>1
              select emails.Key;

来源：https://stackoverflow.com/questions/19852273/check-array-for-duplicates-return-only-items-which-appear-more-than-once

标签

arrays

string

text

duplicates