AWK: Recursive Descent CSV Parser

问题

In response to a Recursive Descent CSV parser in BASH, I (the original author of both posts) have made the following attempt to translate it into AWK script, for speed comparison of data processing with these scripting languages. The translation is not a 1:1 translation due to several mitigating factors, but to those who are interested, this implementation is faster at string processing than the other.

Originally we had a few questions that have all been quashed thanks to Jonathan Leffler. While the title says CSV, we've updated the code to DSV which means you can specify any single character as a field delimiter should you find it necessary.

This code is now ready for showdown.

Basic Features

No imposed limitations on input length, field length, or field count
Literal Quoted Fields via double quote "
ANSI C Escape Sequences as defined here in section 1.1.2^[1][2][3]
Custom Input Delimiter: The Art Of UNIX Programming(DSV)^[4]
Custom Output Delimiter^[5]
UCS-2 and UCS-4 Escape Sequences^[6]

^[1]Quoted fields are literal content, therefore no escape sequence interpretations are performed on quoted content. One can however concatenate quotes, plain text and interpreted sequences in a single field to achieve the desired effect. For example:

one,two,three:\t"Little Endians," and one Big Endian Chief

Is a three field line of CSV where the third field is equivalent to:

three:        Little Endians, and one Big Endian Chief

^[2]The examples described at the reference material as "implementation specific", or possessing "undefined behavior" will not be supported as they are not portable by definition, or too ambiguous to be reliable. If an escape sequence is not defined here or in the reference material, the backslash will be ignored and the single-most following character will be treated as a plain-text value. Integer value character escape sequences will not be supported it is an unreliable method that does not scale well across multiple platforms and unnecessarily, increases the complexity of parsing by proxy of validation.

^[3]Octal character escapes must be in 3-digit octal format. If it is not a 3-digit octal escape sequence it is a single digit null escape sequence. Hexadecimal escape sequences must be in the 2-digit hexadecimal format. If the first two characters following the escape sequence identifier are invalid, no interpretation will take place and a message will be printed on standard error. Any remaining hexadecimal digits are ignored.

^[4]The custom input delimiter iDelimiter must be a single character. Multi-line records will not be supported and usage of such a contradiction should always be frowned upon. This decreases the portability of a data record making it specific to a file whose location and origin (within that file) may be unknown. For instance, greping a file for content may possibly return an incomplete record because the content may begin on any previous line, limiting data acquisition to full top-down parsing of the database.

^[5]The custom output delimiter oDelimiter may be any desirable string value. Script output is always terminated by a single newline. This is a feature of correct terminal application output. Otherwise, your parsed CSV output and terminal prompt would consume the same line creating a confusing situation. Also, most interpreters, like consoles are line based devices, who expect a newline to signal the end of an I/O record. If you find the trailing newline undesirable, trim it off.

^[6]16-bit Unicode escape sequences are available via the following notation:

 \uHHHH Unicode character with hex value HHHH (4 digits)

and 32-bit Unicode escape sequences are supported via:

 \UHHHHHHHH Unicode character with hex value HHHHHHHH (8 digits)

Special Thanks to all Members of the SO community whose experience, time and input led me to create such a wonderfully useful tool for information handling.

Code Listing: dsv.awk

#!/bin/awk -f
#
###############################################################
#
# ZERO LIABILITY OR WARRANTY LICENSE YOU MAY NOT OWN ANY
# COPYRIGHT TO THIS SOFTWARE OR DATA FORMAT IMPOSED HEREIN 
# THE AUTHOR PLACES IT IN THE PUBLIC DOMAIN FOR ALL USES 
# PUBLIC AND PRIVATE THE AUTHOR ASKS THAT YOU DO NOT REMOVE
# THE CREDIT OR LICENSE MATERIAL FROM THIS DOCUMENT.
#
###############################################################
#
# Special thanks to Jonathan Leffler, whose wisdom, and 
# knowledge defined the output logic of this script.
#
# Special thanks to GNU.org for the base conversion routines.
#
# Credits and recognition to the original Author:
# Triston J. Taylor whose countless hours of experience,
# research and rationalization have provided us with a
# more portable standard for parsing DSV records.
#
###############################################################
#
# This script accepts and parses a single line of DSV input
# from <STDIN>.
#
# Record fields are seperated by command line varibale
# 'iDelimiter' the default value is comma.
#
# Ouput is seperated by command line variable 'oDelimiter' 
# the default value is line feed.
#
# To learn more about this tool visit StackOverflow.com:
#
# http://stackoverflow.com/questions/10578119/
#
# You will find there a wealth of information on its
# standards and development track.
#
###############################################################

function NextSymbol() {

    strIndex++;
    symbol = substr(input, strIndex, 1);

    return (strIndex < parseExtent);

}

function Accept(query) {

    #print "query: " query " symbol: " symbol
    if ( symbol == query ) {
        #print "matched!"        
        return NextSymbol();         
    }

    return 0;

}

function Expect(query) {

    # special case: empty query && symbol...
    if ( query == nothing && symbol == nothing ) return 1;

    # case: else
    if ( Accept(query) ) return 1;

    msg = "dsv parse error: expected '" query "': found '" symbol "'";
    print msg > "/dev/stderr";

    return 0;

}

function PushData() {

    field[fieldIndex++] = fieldData;
    fieldData = nothing;

}

function Quote() {

    while ( symbol != quote && symbol != nothing ) {
        fieldData = fieldData symbol;
        NextSymbol();
    }

    Expect(quote);

}

function GetOctalChar() {

    qOctalValue = substr(input, strIndex+1, 3);

    # This isn't really correct but its the only way
    # to express 0-255. On unicode systems it won't
    # matter anyway so we don't restrict the value
    # any further than length validation.

    if ( qOctalValue ~ /^[0-7]{3}$/ ) {

        # convert octal to decimal so we can print the
        # desired character in POSIX awks...

        n = length(qOctalValue)
        ret = 0
        for (i = 1; i <= n; i++) {
            c = substr(qOctalValue, i, 1)
            if ((k = index("01234567", c)) > 0)
            k-- # adjust for 1-basing in awk
            ret = ret * 8 + k
        }

        strIndex+=3;
        return sprintf("%c", ret);

        # and people ask why posix gets me all upset..
        # Special thanks to gnu.org for this contrib..

    }

    return sprintf("\0"); # if it wasn't 3 digit octal just use zero

}

function GetHexChar(qHexValue) {

    rHexValue = HexToDecimal(qHexValue);
    rHexLength = length(qHexValue);

    if ( rHexLength ) {

        strIndex += rHexLength;
        return sprintf("%c", rHexValue);

    }

    # accept no non-sense!
    printf("dsv parse error: expected " rHexLength) > "/dev/stderr";
    printf("-digit hex value: found '" qHexValue "'\n") > "/dev/stderr";

}

function HexToDecimal(hexValue) {

    if ( hexValue ~ /^[[:xdigit:]]+$/ ) {

        # convert hex to decimal so we can print the
        # desired character in POSIX awks...

        n = length(hexValue)
        ret = 0
        for (i = 1; i <= n; i++) {

            c = substr(hexValue, i, 1)
            c = tolower(c)

            if ((k = index("0123456789", c)) > 0)
                k-- # adjust for 1-basing in awk
            else if ((k = index("abcdef", c)) > 0)
                k += 9

            ret = ret * 16 + k
        }

        return ret;

        # and people ask why posix gets me all upset..
        # Special thanks to gnu.org for this contrib..

    }

    return nothing;

}

function BackSlash() {

    # This could be optimized with some constants.
    # but we generate the data here to assist in
    # translation to other programming languages.

    if (symbol == iDelimiter) { # separator precedes all sequences
        fieldData = fieldData symbol;
    } else if (symbol == "a") { # alert
        fieldData = sprintf("%s\a", fieldData);
    } else if (symbol == "b") { # backspace
        fieldData = sprintf("%s\b", fieldData);
    } else if (symbol == "f") { # form feed
        fieldData = sprintf("%s\f", fieldData);
    } else if (symbol == "n") { # line feed
        fieldData = sprintf("%s\n", fieldData);
    } else if (symbol == "r") { # carriage return
        fieldData = sprintf("%s\r", fieldData);
    } else if (symbol == "t") { # horizontal tab
        fieldData = sprintf("%s\t", fieldData);
    } else if (symbol == "v") { # vertical tab
        fieldData = sprintf("%s\v", fieldData);
    } else if (symbol == "0") { # null or 3-digit octal character
        fieldData = fieldData GetOctalChar();
    } else if (symbol == "x") { # 2-digit hexadecimal character 
        fieldData = fieldData GetHexChar( substr(input, strIndex+1, 2) );
    } else if (symbol == "u") { # 4-digit hexadecimal character 
        fieldData = fieldData GetHexChar( substr(input, strIndex+1, 4) );
    } else if (symbol == "U") { # 8-digit hexadecimal character 
        fieldData = fieldData GetHexChar( substr(input, strIndex+1, 8) );
    } else { # symbol didn't match the "interpreted escape scheme"
        fieldData = fieldData symbol; # just concatenate the symbol
    }

    NextSymbol();

}

function Line() {

    if ( Accept(quote) ) {
        Quote();
        Line();
    }

    if ( Accept(backslash) ) {
        BackSlash();
        Line();        
    }

    if ( Accept(iDelimiter) ) {
        PushData();
        Line();
    }

    if ( symbol != nothing ) {
        fieldData = fieldData symbol;
        NextSymbol();
        Line();
    } else if ( fieldData != nothing ) PushData();

}

BEGIN {

    # State Variables
    symbol = ""; fieldData = ""; strIndex = 0; fieldIndex = 0;

    # Output Variables
    field[itemIndex] = "";

    # Control Variables
    parseExtent = 0;

    # Formatting Variables (optionally set on invocation line)
    if ( iDelimiter != "" ) {
        # the algorithm in place does not support multi-character delimiter
        if ( length(iDelimiter) > 1 ) { # we have a problem
            msg = "dsv parse: init error: multi-character delimiter detected:";
            printf("%s '%s'", msg, iDelimiter);
            exit 1;
        }
    } else {
        iDelimiter = ",";
    }
    if ( oDelimiter == "" ) oDelimiter = "\n";

    # Symbol Classes
    nothing = "";
    quote = "\"";
    backslash = "\\";

    getline input;

    parseExtent = (length(input) + 2);

    # parseExtent exceeds length because the loop would terminate
    # before parsing was complete otherwise.

    NextSymbol();
    Line();
    Expect(nothing);

}

END {

    if (fieldIndex) {

        fieldIndex--;

        for (i = 0; i < fieldIndex; i++)
        {
             printf("%s", field[i] oDelimiter);
        }

        print field[i];

    } 

}

How to Run The Script "Like a Pro"

# Spit out some CSV "newline" delimited:
echo 'one,two,three,AWK,CSV!' | awk -f dsv.awk

# Spit out some CSV "tab" delimited:
echo 'one,two,three,AWK,CSV!' | awk -v oDelimiter=$'\t' -f dsv.awk

# Spit out some CSV "ASCII Group Separator" delimited:
echo 'one,two,three,AWK,CSV!' | awk -v oDelimiter=$'\29' -f dsv.awk

If you need some custom output control separators but aren't sure what to use, you may consult this handy ASCII chart

Future Plans:

C library Implementation
C Console Application Implementation
Submission to The Internet Engineering Task Force for Possible Standardization

Philosphy

Escape sequences should always be used to create multi-line field data in a line based database, and quoting should always be used to preserve and concatenate record field content. This is the simplest (and therefore most efficient) way to implement a record parser of this type. I encourage all software developers and educational institutions to take up and profess this direction to ensure portability and exact acquisition of line based delimiter separated records.

CSV has no official specification other than RFC 4180 and it does not define any useful portable record types. It is my hope as a developer with experience of over 15 years this will become the officially recognized standard for Portable CSV/DSV Records.

回答1:

There were way too many blank lines in the original version of the code, which made it hard to read. The revised code with reduced blank lines is much more easily read; related lines are in blocks that can be read together. Thanks.

awk is like C; it treats 0 as false and anything non-zero as true. So, anything greater than 0 is true, but so is anything less than 0.

There isn't a direct way to print to stderr in standard awk. GNU AWK documents the use of print "message" > "/dev/stderr" (name as string!) and implies that it might work even on systems without the actual device. It will work with standard awk too on systems with the /dev/stderr device.

The awk idiom for processing each index in an array is for (i in array) { ... }. However, since you have an index, itmIndex, telling you how many items are in the array, you should use

for (i = 0; i < itmIndex; i++) { printf("%s%s", item[i], delim); }

and then output a newline at the end. That gets one delimiter too many to my way of thinking, but that's a transcription of what the bash code is doing. My usual trick for this is:

pad = ""
for (i = 0; i < itmIndex; i++)
{
     printf("%s%s", pad, item[i])
     pad = delim
}
print "";

You can pass variables into the script with -v var=value (or omit the -v). See the POSIX URL listed before.

来源：https://stackoverflow.com/questions/10578119/awk-recursive-descent-csv-parser

标签

parsing

csv

awk

recursive-descent