Efficiently identify changed fields in CSV files using c#

后端 未结 5 1689
梦毁少年i
梦毁少年i 2020-12-11 07:09

This turned out to be more difficult than I thought. Basically, each day a snapshot of a customer master list is being dumped by a system into CSV. It contains about 12000

5条回答
  •  攒了一身酷
    2020-12-11 07:43

    The other have already provided good answers, I'm just going to provide something different for your consideration.

    The pseudocode:

    Read 1000 from each source.
    Compare the records.
    If changed, store in list of changed records.
    If not changed, discard from list.
    If not exists, keep in list.
    Repeat until all records are exhausted.
    

    This code assumes that the records are not sorted.

    An alternative would be to:

    Read all the records and determine what are all the first characters.
    Then for each character,
        Read and find records starting with that character.
        Perform comparison as necessary
    

    An improvement over the above would be to write a new file if the used records exceed a certain threshold. eg:

    Read all the records and determine what are all the first characters and the number of occurrence.
    Sort by characters with the highest occurrence.
    Then for each character,
        Read and find records starting with that character.
        If number of occurrence exceed a certain limit, write records that doesn't start with the character into a new file. // this reduces the amount of data that must be read from file
        Perform comparison as necessary
    

提交回复
热议问题