Delete duplicates in a single dataset relative to another one in C#

好久不见. 提交于 2019-12-24 11:59:22

问题


I'm new to C#. Trying to remove duplicates in CollectionIn1 but it doesn't work. No duplicates are removed in CollectionIn.

Just to clarify, the collectionIn has [A,B,C,D] and collectionIn2 has [A,B,C].

So i want to remove the values (A,B,C) in collectionIn

for (int i = 0; i < CollectionIn.Rows.Count; i++) {
    string value1 = CollectionIn.Rows[i].ItemArray[0].ToString().ToLower(); 

    for (int i2 = 0; i2 < CollectionIn2.Rows.Count; i2++) {
        string value2 = CollectionIn2.Rows[i2].ItemArray[0].ToString().ToLower(); 

        if (value1 == value2) {
            //Remove value1 when value1 == value2
            CollectionIn.Rows[i].Delete(); <--- Trying to delete when there is duplicate in both collections

            CollectionIn.AcceptChanges();
        }
    }
    //CollectionOut.Rows.Add(value1);
}

I made some changes from this link http://www.rpaforum.net/threads/how-to-compare-two-excel-sheet-using-c-code-in-blueprism.897/


回答1:


Comparing two sets can have a complexity of O(n2). This is bad. You can get improve on that if you have an initial hash lookup.

var Set1 = new Dictionary<string, int>();

//Prehash all values in the set that won't be deleted from 
for (int i = 0; i < CollectionIn.Rows.Count; i++)
{
    string value1 = CollectionIn.Rows[i].ItemArray[0].ToString().ToLower();
    Set1.Add(value1, i);
}

//Loop over the other set
for (int i2 = 0; i2 < CollectionIn2.Rows.Count; i2++)
{
    string value2 = CollectionIn2.Rows[i2].ItemArray[0].ToString().ToLower();

    int foundIndex;
    if (Set1.TryGetValue(value2, out foundIndex) == false)
        continue;

    //Remove value1 when value1 == value2
    CollectionIn.Rows[foundIndex].Delete();
}
CollectionIn.AcceptChanges(); //It's probably best to save changes last as a single call

I hashed CollectionIn, and then iterated CollectionIn2. This means I needed a dictionary so I would have the CollectionIn index for deleting. If this was reversed, and CollectionIn2 was hashed, it would only need to be a hashset, and it would be better because it would be able to handle internal-duplicates within the CollectionIn set, therefore:

var Set2 = new HashSet<string>();

//Prehash all values in one set (ideally the larger set)
for (int i2 = 0; i2 < CollectionIn2.Rows.Count; i2++)
{
    string value2 = CollectionIn2.Rows[i2].ItemArray[0].ToString().ToLower();

    if (Set2.Contains(value2))
        continue; //Duplicate value
    else
        Set2.Add(value2);
}

//Loop over the other set
for (int i1 = 0; i1 < CollectionIn.Rows.Count; i1++)
{
    string value1 = CollectionIn.Rows[i1].ItemArray[0].ToString().ToLower();

    if (Set2.Contains(value1) == false)
        continue;

    //Remove value1 when value1 == value2
    CollectionIn.Rows[i1].Delete();
}

CollectionIn.AcceptChanges(); //It's probably best to save changes last as a single call

This pattern would apply to many data set types (including List, array, and more). Of course, if you can write SQL for remote datasets on the same database, that's better.

If you love lambda functions, it should look something like this:

var alreadyInSet2 = new HashSet<string>(CollectionIn2.Rows.Cast<DataRow>()
                    .Select(x => x[0].ToString().ToLower()));

CollectionIn.Rows.Cast<DataRow>()
                    .Where(y => alreadyInSet2.Contains(y[0].ToString().ToLower()) == false)
                    .ToList() //I think you technically need this before calling ForEach
                    .ForEach(y => y.Delete());

CollectionIn.AcceptChanges();                   

Also see: With two very large lists/collections - how to detect and/or remove duplicates efficiently - where more time/work can go into a broader arrange of answers and performance enhancements.




回答2:


You can remove duplicates using the Distinct operator.

To remove duplicates from something like an IList<> you could do:

yourList.RemoveAll( yourList.Except( yourList.Distinct() ) );



回答3:


foreach(var row in CollectionIn.Rows.Cast<DataRow>()
    .Where(x => CollectionIn2.Rows.Cast<DataRow>()
    .Any(y => y[0].ToString().ToLower() == x[0].ToString().ToLower())))
{
    row.Delete();
}
CollectionIn.AcceptChanges();

Not the best performance, but it works and is easy to read.

Also your code has a bug in it due to modifying the collections while iterating over them.




回答4:


It works and is easy to understand.

List<string> List1 = new List<string> { "A", "B", "C", "D" };
List<string> List2 = new List<string> { "A", "B", "C" };
List<string> ListTemp = new List<string>();

foreach (string str1 in List1)
{
     foreach (string str2 in List2)
     {
          if (str1 == str2)
          {
               ListTemp.Add(str1);
          }
     }
 }            

foreach (string temp in ListTemp)
{
     List1.Remove(temp);
}



回答5:


mylist2 = mylist2.Distinct().ToList();
mylist1.RemoveAll(item => mylist2.Contains(item));


来源:https://stackoverflow.com/questions/51015932/delete-duplicates-in-a-single-dataset-relative-to-another-one-in-c-sharp

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!