Win32/C: Convert line endings to DOS/Windows format

心不动则不痛 提交于 2019-12-25 04:52:07

问题


I've the following C function in a Windows API project that reads a file and based on the line endings (UNIX, MAC, DOS) it replaces the line endings with the right line-endings for Windows (\r\n):

// Standard C header needed for string functions
#include <string.h>

// Defines for line-ending conversion function
#define LESTATUS INT 
#define LE_NO_CHANGES_NEEDED (0)
#define LE_CHANGES_SUCCEEDED (1)
#define LE_CHANGES_FAILED   (-1)

/// <summary>
/// If the line endings in a block of data loaded from a file contain UNIX (\n) or MAC (\r) line endings, this function replaces it with DOS (\r\n) endings.
/// </summary>
/// <param name="inData">An array of bytes of input data.</param>
/// <param name="inLen">The size, in bytes, of inData.</param>
/// <param name="outData">An array of bytes to be populated with output data.  This array must already be allocated</param>
/// <param name="outLen">The maximum number of bytes that can be stored in outData.</param>
/// <param name="bytesWritten">A pointer to an integer that receives the number of bytes written into outData.</param>
/// <returns>
/// If no changes were necessary (the file already contains \r\n line endings), then the return value is LE_NO_CHANGES_NEEDED.<br/>
/// If changes were necessary, and it was possible to store the entire output buffer, the return value is LE_CHANGES_SUCCEEDED.<br/>
/// If changes were necessary but the output buffer was too small, the return value is LE_CHANGES_FAILED.<br/>
/// </returns>
LESTATUS ConvertLineEndings(BYTE* inData, INT inLen, BYTE* outData, INT outLen, INT* bytesWritten)
{
    char *posR = strstr(inData, "\r");
    char *posN = strstr(inData, "\n");
    // Case 1: the file already contains DOS/Windows line endings.
    // So, copy the input array into the output array as-is (if we can)
    // Report an error if the output array is too small to hold the input array; report success otherwise.
    if (posN != NULL && posR != NULL)
    {
        if (outLen >= inLen)
        {
            strcpy(outData, inData);
            return LE_NO_CHANGES_NEEDED;
        }
        return LE_CHANGES_FAILED;
    }
    // Case 2: the file contains UNIX line endings.
    else if (posN != NULL && posR == NULL)
    {
        int i = 0;
        int track = 0;
        for (i = 0; i < inLen; i++)
        {
            if (inData[i] != '\n')
            {
                outData[track] = inData[i];
                track++;
                if (track>outLen) return LE_CHANGES_FAILED;
            }
            else
            {
                outData[track] = '\r';
                track++;
                if (track > outLen) return LE_CHANGES_FAILED;
                outData[track] = '\n';
                track++;
                if (track > outLen) return LE_CHANGES_FAILED;
            }
            *bytesWritten = track;
        }
    }
    // Case 3: the file contains Mac-style line endings.
    else if (posN == NULL && posR != NULL)
    {
        int i = 0;
        int track = 0;
        for (i = 0; i < inLen; i++)
        {
            if (inData[i] != '\r')
            {
                outData[track] = inData[i];
                track++;
                if (track>outLen) return LE_CHANGES_FAILED;
            }
            else
            {
                outData[track] = '\r';
                track++;
                if (track > outLen) return LE_CHANGES_FAILED;
                outData[track] = '\n';
                track++;
                if (track > outLen) return LE_CHANGES_FAILED;
            }
            *bytesWritten = track;
        }
    }
    return LE_CHANGES_SUCCEEDED;
}

However, I feel like this function is very long (almost 70 lines) and could be reduced somehow. I've searched on Google but couldn't find anything useful; is there any function in either the C library or the Windows API that will allow me to perform a string-replace rather than manually searching the string byte-by-byte in O(n) time?


回答1:


Every character needs looking at precisely one time, not more and not less. The very first line of your code already makes repeated comparisons, as both strstr calls start at the same position. You could have used something like

char *posR = strstr(inData, "\r");
if (posR && posR[1] == '\n')
   // Case 1: the file already contains DOS/Windows line endings.

and if this fails, continue from where you ended if you did find an \r or, if posR == NULL, starting from the top again. But then you made the strstr already "look at" every character until the end!

Two additional notes:

  1. there was no need for strstr because you are looking for a single character; use strchr next time;
  2. the strXXX functions all assume your input is a properly formed C string: it should end with a terminating 0. However, you already provide the length in inLen, so you don't have to check for zeroes. If there may or may not be a 0 in your input before inLen bytes, you need to take appropriate action. Based on the purpose of this function, I'm assuming you don't need to check for zeroes at all.

My proposal: look at every character from the start once, and only take action when it is either an \r or an \n. If the first of these you encounter is an \r and the next one is an \n, you're done. (This assumes the line endings are not "mixed".)

If you do not return in this first loop, there is something else than \r\n, and you can continue from that point on. But you still only have to act on either an \r or \n! So I propose this shorter code (and an enum instead of your defines):

enum LEStatus_e { LE_CHANGES_FAILED=-1, LE_NO_CHANGES_NEEDED, LE_CHANGES_SUCCEEDED };

enum LEStatus_e ConvertLineEndings(BYTE *inData, INT inLen, BYTE *outData, INT outLen, INT *bytesWritten)
{
    INT sourceIndex = 0, destIndex;

    if (outLen < inLen)
        return LE_CHANGES_FAILED;

    /*  Find first occurrence of either \r or \n
        This will return immediately for No Change Needed */
    while (sourceIndex < inLen)
    {
        if (inData[sourceIndex] == '\r')
        {
            if (sourceIndex < inLen-1 && inData[sourceIndex+1] == '\n')
            {
                memcpy (outData, inData, inLen);
                *bytesWritten = inLen;
                return LE_NO_CHANGES_NEEDED;
            }
            break;
        }
        if (inData[sourceIndex] == '\n')
            break;
        sourceIndex++;
    }
    /* We processed this far already: */
    memcpy (outData, inData, sourceIndex);
    if (sourceIndex == inLen)
        return LE_NO_CHANGES_NEEDED;
    destIndex = sourceIndex;

    while (sourceIndex < inLen)
    {
        switch (inData[sourceIndex])
        {
            case '\n':
            case '\r':
                sourceIndex++;
                if (destIndex+2 >= outLen)
                    return LE_CHANGES_FAILED;
                outData[destIndex++] = '\r';
                outData[destIndex++] = '\n';
                break;
            default:
                outData[destIndex++] = inData[sourceIndex++];
        }
    }
    *bytesWritten = destIndex;
    return LE_CHANGES_SUCCEEDED;
}

There are a few old and rare 'plain text' formats that use other constructions; from memory, something like \r\n\n. If you want to be able to sanitize anything, you can add a skip for all \rs after a single \n, and the same for the opposite case. This will also clean up any "mixed" line endings, as it will correctly treat \r\n as well.




回答2:


Here's what I would consider a somewhat simpler code, half as many lines. Of course, as Ben Voigt pointed out, you can't beat O(n) time, so I made no attempt to do so. I didn't use any library functions, because it seems simpler this way, and I doubt that extra function calls could make the code faster.

enum lestatus {
  le_no_changes_needed = 0,
  le_changes_succeeded = 1,
  le_changes_failed = -1
};

enum lestatus ConvertLineEndings(char *indata, int inlen,
                                 char *outdata, int outlen)
{
  int outpos = 0, inpos;
  enum lestatus it_changed = le_no_changes_needed;
  for (inpos = 0; inpos<inlen;inpos++) {
    if (outpos + 1 > outlen) return le_changes_failed;
    if (indata[inpos] != '\r' && indata[inpos] != '\n') {
      /* it is an ordinary character, just copy it */
      outdata[outpos++] = indata[inpos];
    } else if (outpos + 2 > outlen) {
      return le_changes_failed;
    } else if ((indata[inpos+1] == '\r' || indata[inpos+1] == '\n')
               && indata[inpos] != indata[inpos+1]) {
      /* it is \r\n or \n\r, output it in canonical order */
      outdata[outpos++] = '\r';
      outdata[outpos++] = '\n';
      inpos++; /* skip the second character */
    } else {
      /* it is a mac or unix line ending, convert to dos */
      outdata[outpos++] = '\r';
      outdata[outpos++] = '\n';
      it_changed = le_changes_succeeded;
    }
  }
  return it_changed;
}

The biggest differences in my code are that

  1. I used the increment operator.
  2. I avoided library functions for simplicity.
  3. My function handles mixed-ending files correctly (in my interpretation).
  4. I prefer lowercase characters. This is obviously a stylistic preference.
  5. I prefer an enum over #defines. Also a stylistic preference.


来源:https://stackoverflow.com/questions/30540607/win32-c-convert-line-endings-to-dos-windows-format

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!