Efficiently identify changed fields in CSV files using c#

后端未结

关注

 5  1694

This turned out to be more difficult than I thought. Basically, each day a snapshot of a customer master list is being dumped by a system into CSV. It contains about 12000

相关标签:

5条回答

梦毁少年i

2020-12-11 07:26

Where are you export that CSV from?

Is your original source a database? If so, why can't you run your query against the database? It will be much more performant than any LINQ implementation.

0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-12-11 07:35

This may be best accomplished in a database rather than in code: Create two tables, current and old, import the data from the CSV files into the proper tables and use a combination of SQL queries to generate the output.

0 讨论(0)
发布评论:

提交评论
- 加载中...

攒了一身酷

2020-12-11 07:43

The other have already provided good answers, I'm just going to provide something different for your consideration.

The pseudocode:

Read 1000 from each source.
Compare the records.
If changed, store in list of changed records.
If not changed, discard from list.
If not exists, keep in list.
Repeat until all records are exhausted.

This code assumes that the records are not sorted.

An alternative would be to:

Read all the records and determine what are all the first characters.
Then for each character,
    Read and find records starting with that character.
    Perform comparison as necessary

An improvement over the above would be to write a new file if the used records exceed a certain threshold. eg:

Read all the records and determine what are all the first characters and the number of occurrence.
Sort by characters with the highest occurrence.
Then for each character,
    Read and find records starting with that character.
    If number of occurrence exceed a certain limit, write records that doesn't start with the character into a new file. // this reduces the amount of data that must be read from file
    Perform comparison as necessary

0 讨论(0)

无人共我

2020-12-11 07:45

Extending Jims answer, a basic example:

public class MyRecord
{
  public MyRecord(int id)
  {
    Id = id;
    Fields = new int[60];
  }

  public int Id;
  public int[] Fields;
}

Then test code:

var recordsOld = new List<MyRecord>();
var recordsNew = new List<MyRecord>();

for (int i = 0; i < 120000; i++)
{
  recordsOld.Add(new MyRecord(i));
  recordsNew.Add(new MyRecord(i));
}

var watch = new System.Diagnostics.Stopwatch();
int j = 0;

watch.Start();
for (int i = 0; i < recordsOld.Count; i++)
{
  while (recordsOld[i].Id != recordsNew[j].Id)
  {
    j++;
  }

  for (int k = 0; k < recordsOld[i].Fields.Length; k++)
  {
    if (recordsOld[i].Fields[k] != recordsNew[j].Fields[k])
    {
      // do your stuff here
    }
  }
}
watch.Stop();
string time = watch.ToString();

Takes 200ms to run, assuming the list is in order. Now, I'm sure that code has heaps of bugs but in the most basic sense it doesn't take the processor long to do millions of iterations. You either have some complex comparison checks, or some code is terribly inefficient.

0 讨论(0)

渐次进展

2020-12-11 07:50

For the purposes of the discussion below, I'll assume that you have some way of reading the CSV files into a class. I'll call that class MyRecord.

Load the files into separate lists, call them NewList and OldList:

List<MyRecord> NewList = LoadFile("newFilename");
List<MyRecord> OldList = LoadFile("oldFilename");

There's perhaps a more elegant way to do this with LINQ, but the idea is to do a straight merge. First you have to sort the two lists. Either your MyRecord class implements IComparable, or you supply your own comparison delegate:

NewList.Sort(/* delegate here */);
OldList.Sort(/* delegate here */);

You can skip the delegate if MyRecord implements IComparable.

Now it's a straight merge.

int ixNew = 0;
int ixOld = 0;
while (ixNew < NewList.Count && ixOld < OldList.Count)
{
    // Again with the comparison delegate.
    // I'll assume that MyRecord implements IComparable
    int cmpRslt = OldList[ixOld].CompareTo(NewList[ixNew]);
    if (cmpRslt == 0)
    {
        // records have the same customer id.
        // compare for changes.
        ++ixNew;
        ++ixOld;
    }
    else if (cmpRslt < 0)
    {
        // this old record is not in the new file.  It's been deleted.
        ++ixOld;
    }
    else
    {
        // this new record is not in the old file.  It was added.
        ++ixNew;
    }
}

// At this point, one of the lists might still have items.
while (ixNew < NewList.Count)
{
    // NewList[ixNew] is an added record
    ++ixNew;
}

while (ixOld < OldList.Count)
{
    // OldList[ixOld] is a deleted record
}

With just 120,000 records, that should execute very quickly. I would be very surprised if doing the merge took as long as loading the data from disk.

EDIT: A LINQ solution

I pondered how one would do this with LINQ. I can't do exactly the same thing as the merge above, but I can get the added, removed, and changed items in separate collections.
For this to work, MyRecord will have to implement IEquatable<MyRecord> and also override GetHashCode.

var AddedItems = NewList.Except(OldList);
var RemovedItems = OldList.Except(NewList);

var OldListLookup = OldList.ToLookup(t => t.Id);
var ItemsInBothLists =
    from newThing in NewList
    let oldThing = OldListLookup[newThing.Id].FirstOrDefault()
    where oldThing != null
    select new { oldThing = oldThing, newThing = newThing };

In the above, I assume that MyRecord has an Id property that is unique.

If you want just the changed items instead of all the items that are in both lists:

var ChangedItems =
    from newThing in NewList
    let oldThing = OldListLookup[newThing.Id].FirstOrDefault()
    where oldThing != null && CompareItems(oldThing, newThing) != 0
    select new { oldThing = oldThing, newThing = newThing };

The assumption is that the CompareItems method will do a deep comparison of the two items and return 0 if they compare equal or non-zero if something has changed.

0 讨论(0)