C# Processing Fixed Width Files - Solution Not Working

时光怂恿深爱的人放手 提交于 2019-12-11 05:15:45

问题


I have implemented Cuong's solution here: C# Processing Fixed Width Files

Here is my code:

        var lines = File.ReadAllLines(@fileFull);
        var widthList = lines.First().GroupBy(c => c)
        .Select(g => g.Count())
        .ToList();

        var list = new List<KeyValuePair<int, int>>();

        int startIndex = 0;

        for (int i = 0; i < widthList.Count(); i++)
        {
            var pair = new KeyValuePair<int, int>(startIndex, widthList[i]);
            list.Add(pair);

            startIndex += widthList[i];
        }

        var csvLines = lines.Select(line => string.Join(",",
        list.Select(pair => line.Substring(pair.Key, pair.Value))));

        File.WriteAllLines(filePath + "\\" + fileName + ".csv", csvLines);

@fileFull = File Path & Name

The issue I have is the first line of the input file also contains digits. So it could be AAAAAABBC111111111DD2EEEEEE etc. For some reason the output from Cuong's code gives me CSV headings like 1111RRRR and 222223333.

Does anyone know why this is and how I would fix it?


Header row example:

AAAAAAAAAAAAAAAABBBBBBBBBBCCCCCCCCDEFCCCCCCCCCGGGGGGGGHHHHHHHHIJJJJJJJJKKKKLLLLMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOPPPPQQQQ1111RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR222222222333333333444444444555555555666666666777777777888888888999999999S00001111TTTTTTTTTTTTUVWXYZ!"£$$$$$$%&  

Converted header row:

AAAAAAAAAAAAAAAA    BBBBBBBBBB  CCCCCCCCDEFCCCCCC   C   C   C   GGGGGGGG    HHHHHHHH    I   JJJJJJJJ    KKKK    LLLL    MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO  PPPP    QQQQ    1111RRRR    RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR2222    222223333   333334444   444445555   555556666   666667777   777778888   888889999   99999S000   0   1111    TTTTTTTTTTTT    U   V   W   X   Y   Z   !   ",�,$$$$$$,%,&,"  

Jodrell - I implemented your suggestion but the header output is like:

BBBBBBBBBBCCCCCC    CCCCCCCCD   DEFCCCC             GGGGGGGG    HHHHHHH IJJJJJJ     KKKKLLL LLL MMM NNNNNNNNNNNNNNNNNNNNNNNNNNNNN   OOOOOOOOOOOOOOOOOOOOOOOOOOOOO   PPPPQQQQ1111RRRRRRRRRRRRRRRRR   QQQ 111 RRR 33333333    44444444    55555555    66666666    77777777    88888888    99999999    S0000111        111 TTT UVWXYZ!"�$$                                       %&

回答1:


As Jodrell already mentioned, your code doesn't work because it assumed that the character representing each column header is distinct. Change the code that parse the header widths would fix it.

Replace:

var widthList = lines.First().GroupBy(c => c)
.Select(g => g.Count())
.ToList();

With:

var widthList = new List<int>(); 
var header = lines.First().ToArray(); 
for (int i = 0; i < header.Length; i++) 
{ 
    if (i == 0 || header[i] != header[i-1]) 
        widthList.Add(0); 
    widthList[widthList.Count-1]++; 
}

Parsed header columns:

AAAAAAAAAAAAAAAA    BBBBBBBBBB  CCCCCCCC    D   E   F   CCCCCCCCC   GGGGGGGG    HHHHHHHH    I   JJJJJJJJ    KKKK    LLLL    MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO  PPPP    QQQQ    1111    RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR    222222222   333333333   444444444   555555555   666666666   777777777   888888888   999999999   S   0000    1111    TTTTTTTTTTTT    U   V   W   X   Y   Z   !   "   £   $$$$$$  %   &



回答2:


EDIT

Because the problem annoyed me I wrote some code that handles " and ,. This code replaces the header row with comma delimited alternating zeros and ones. Any commas or double quotes in the body are appropriately escaped.

static void FixedToCsv(string sourceFile)
{
    if (sourceFile == null)
    {
        // Throw exception
    }

    var dir = Path.GetDirectory(sourceFile)
    var destFile = string.Format(
        "{0}{1}",
        Path.GetFileNameWithoutExtension(sourceFile),
        ".csv");

    if (dir != null)
    {
        destFile = Path.Combine(dir, destFile);
    }

    if (File.Exists(destFile))
    {
        // Throw Exception
    }

    var blocks = new List<KeyValuePair<int, int>>();
    using (var output = File.OpenWrite(destFile))
    {
        using (var input = File.OpenText(sourceFile))
        {
            var outputLine = new StringBuilder();

            // Make header
            var header = input.ReadLine();

            if (header == null)
            {
                return;
            }

            var even = false;
            var lastc = header.First();
            var counter = 0;
            var blockCounter = 0;
            foreach(var c in header)
            {
                counter++;
                if (c == lastc)
                {
                    blockCounter++;
                }
                else
                {
                    blocks.Add(new KeyValuePair<int, int>(
                        counter - blockCounter - 1,
                        blockCounter));
                    blockCounter = 1;
                    outputLine.Append(',');
                    even = !even;
                }

                outputLine.Append(even ? '1' : '0');

                lastc = c;
            }

            blocks.Add(new KeyValuePair<int, int>(
                counter - blockCounter,
                blockCounter));

            outputLine.AppendLine();
            var lineBytes = Encoding.UTF.GetBytes(outputLine.ToString());
            outputLine.Clear();
            output.Write(lineBytes, 0, lineBytes.Length);

            // Process Body
            var inputLine = input.ReadLine();
            while (inputLine != null)
            {
                foreach(var block in block.Select(b =>
                    inputLine.Substring(b.Key, b.Value)))
                {
                    var sanitisedBlock = block;
                    if (block.Contains(',') || block.Contains('"'))
                    {
                        santitisedBlock = string.Format(
                            "\"{0}\"",
                            block.Replace("\"", "\"\""));
                    }

                   outputLine.Append(sanitisedBlock);
                   outputLine.Append(',');
                }

                outputLine.Remove(outputLine.Length - 1, 1);
                outputLine.AppendLine();
                lineBytes = Encoding.UTF8.GetBytes(outputLne.ToString());
                outputLine.Clear();
                output.Write(lineBytes, 0, lineBytes.Length);

                inputLine = input.ReadLine();
            }
        }
    }
}

1 is repeated in your header row, so your two fours get counted as one eight and everything goes wrong from there.

(There is a block of four 1s after the Qs and another block of four 1s after the 0s)

Essentialy, your header row is invalid or, at least, doesen't work with the proposed solution.


Okay, you could do somthing like this.

public void FixedToCsv(string fullFile)
{
    var lines = File.ReadAllLines(fullFile);
    var firstLine = lines.First();

    var widths = new List<KeyValuePair<int, int>>();

    var innerCounter = 0;
    var outerCounter = 0
    var firstLineChars = firstLine.ToCharArray();
    var lastChar = firstLineChars[0]; 
    foreach(var c in firstLineChars)
    {
        if (c == lastChar)
        {
            innerCounter++;
        }
        else
        {
            widths.Add(new KeyValuePair<int, int>(
                outerCounter
                innerCounter);
            innerCounter = 0;
            lastChar = c;
        }
        outerCounter++;
    }

    var csvLines = lines.Select(line => string.Join(",",
        widths.Select(pair => line.Substring(pair.Key, pair.Value))));

    // Get filePath and fileName from fullFile here.
    File.WriteAllLines(filePath + "\\" + fileName + ".csv", csvLines);
}


来源:https://stackoverflow.com/questions/12778173/c-sharp-processing-fixed-width-files-solution-not-working

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!