Efficiently identify changed fields in CSV files using c#

后端未结

关注

 5  1700

梦毁少年i 2020-12-11 07:09

This turned out to be more difficult than I thought. Basically, each day a snapshot of a customer master list is being dumped by a system into CSV. It contains about 12000

5条回答

渐次进展 (楼主)

2020-12-11 07:50

For the purposes of the discussion below, I'll assume that you have some way of reading the CSV files into a class. I'll call that class MyRecord.

Load the files into separate lists, call them NewList and OldList:

List NewList = LoadFile("newFilename");
List OldList = LoadFile("oldFilename");

There's perhaps a more elegant way to do this with LINQ, but the idea is to do a straight merge. First you have to sort the two lists. Either your MyRecord class implements IComparable, or you supply your own comparison delegate:

NewList.Sort(/* delegate here */);
OldList.Sort(/* delegate here */);

You can skip the delegate if MyRecord implements IComparable.

Now it's a straight merge.

int ixNew = 0;
int ixOld = 0;
while (ixNew < NewList.Count && ixOld < OldList.Count)
{
    // Again with the comparison delegate.
    // I'll assume that MyRecord implements IComparable
    int cmpRslt = OldList[ixOld].CompareTo(NewList[ixNew]);
    if (cmpRslt == 0)
    {
        // records have the same customer id.
        // compare for changes.
        ++ixNew;
        ++ixOld;
    }
    else if (cmpRslt < 0)
    {
        // this old record is not in the new file.  It's been deleted.
        ++ixOld;
    }
    else
    {
        // this new record is not in the old file.  It was added.
        ++ixNew;
    }
}

// At this point, one of the lists might still have items.
while (ixNew < NewList.Count)
{
    // NewList[ixNew] is an added record
    ++ixNew;
}

while (ixOld < OldList.Count)
{
    // OldList[ixOld] is a deleted record
}

With just 120,000 records, that should execute very quickly. I would be very surprised if doing the merge took as long as loading the data from disk.

EDIT: A LINQ solution

I pondered how one would do this with LINQ. I can't do exactly the same thing as the merge above, but I can get the added, removed, and changed items in separate collections.
For this to work, MyRecord will have to implement IEquatable and also override GetHashCode.

var AddedItems = NewList.Except(OldList);
var RemovedItems = OldList.Except(NewList);

var OldListLookup = OldList.ToLookup(t => t.Id);
var ItemsInBothLists =
    from newThing in NewList
    let oldThing = OldListLookup[newThing.Id].FirstOrDefault()
    where oldThing != null
    select new { oldThing = oldThing, newThing = newThing };

In the above, I assume that MyRecord has an Id property that is unique.

If you want just the changed items instead of all the items that are in both lists:

var ChangedItems =
    from newThing in NewList
    let oldThing = OldListLookup[newThing.Id].FirstOrDefault()
    where oldThing != null && CompareItems(oldThing, newThing) != 0
    select new { oldThing = oldThing, newThing = newThing };

The assumption is that the CompareItems method will do a deep comparison of the two items and return 0 if they compare equal or non-zero if something has changed.

0 讨论(0)

查看其它5个回答