Efficiently identify changed fields in CSV files using c#

后端 未结 5 1700
梦毁少年i
梦毁少年i 2020-12-11 07:09

This turned out to be more difficult than I thought. Basically, each day a snapshot of a customer master list is being dumped by a system into CSV. It contains about 12000

5条回答
  •  渐次进展
    2020-12-11 07:50

    For the purposes of the discussion below, I'll assume that you have some way of reading the CSV files into a class. I'll call that class MyRecord.

    Load the files into separate lists, call them NewList and OldList:

    List NewList = LoadFile("newFilename");
    List OldList = LoadFile("oldFilename");
    

    There's perhaps a more elegant way to do this with LINQ, but the idea is to do a straight merge. First you have to sort the two lists. Either your MyRecord class implements IComparable, or you supply your own comparison delegate:

    NewList.Sort(/* delegate here */);
    OldList.Sort(/* delegate here */);
    

    You can skip the delegate if MyRecord implements IComparable.

    Now it's a straight merge.

    int ixNew = 0;
    int ixOld = 0;
    while (ixNew < NewList.Count && ixOld < OldList.Count)
    {
        // Again with the comparison delegate.
        // I'll assume that MyRecord implements IComparable
        int cmpRslt = OldList[ixOld].CompareTo(NewList[ixNew]);
        if (cmpRslt == 0)
        {
            // records have the same customer id.
            // compare for changes.
            ++ixNew;
            ++ixOld;
        }
        else if (cmpRslt < 0)
        {
            // this old record is not in the new file.  It's been deleted.
            ++ixOld;
        }
        else
        {
            // this new record is not in the old file.  It was added.
            ++ixNew;
        }
    }
    
    // At this point, one of the lists might still have items.
    while (ixNew < NewList.Count)
    {
        // NewList[ixNew] is an added record
        ++ixNew;
    }
    
    while (ixOld < OldList.Count)
    {
        // OldList[ixOld] is a deleted record
    }
    

    With just 120,000 records, that should execute very quickly. I would be very surprised if doing the merge took as long as loading the data from disk.

    EDIT: A LINQ solution

    I pondered how one would do this with LINQ. I can't do exactly the same thing as the merge above, but I can get the added, removed, and changed items in separate collections.
    For this to work, MyRecord will have to implement IEquatable and also override GetHashCode.

    var AddedItems = NewList.Except(OldList);
    var RemovedItems = OldList.Except(NewList);
    
    var OldListLookup = OldList.ToLookup(t => t.Id);
    var ItemsInBothLists =
        from newThing in NewList
        let oldThing = OldListLookup[newThing.Id].FirstOrDefault()
        where oldThing != null
        select new { oldThing = oldThing, newThing = newThing };
    

    In the above, I assume that MyRecord has an Id property that is unique.

    If you want just the changed items instead of all the items that are in both lists:

    var ChangedItems =
        from newThing in NewList
        let oldThing = OldListLookup[newThing.Id].FirstOrDefault()
        where oldThing != null && CompareItems(oldThing, newThing) != 0
        select new { oldThing = oldThing, newThing = newThing };
    

    The assumption is that the CompareItems method will do a deep comparison of the two items and return 0 if they compare equal or non-zero if something has changed.

提交回复
热议问题