Parsing a CSV file using gawk

后端 未结 9 1256
感动是毒
感动是毒 2020-11-29 12:12

How do you parse a CSV file using gawk? Simply setting FS=\",\" is not enough, as a quoted field with a comma inside will be treated as multiple fields.

<
9条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-11-29 12:42

    {
      ColumnCount = 0
      $0 = $0 ","                           # Assures all fields end with comma
      while($0)                             # Get fields by pattern, not by delimiter
      {
        match($0, / *"[^"]*" *,|[^,]*,/)    # Find a field with its delimiter suffix
        Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
        gsub(/^ *"?|"? *,$/, "", Field)     # Strip delimiter text: comma/space/quote
        Column[++ColumnCount] = Field       # Save field without delimiter in an array
        $0 = substr($0, RLENGTH + 1)        # Remove processed text from the raw data
      }
    }
    

    Patterns that follow this one can access the fields in Column[]. ColumnCount indicates the number of elements in Column[] that were found. If not all rows contain the same number of columns, Column[] contains extra data after Column[ColumnCount] when processing the shorter rows.

    This implementation is slow, but it appears to emulate the FPAT/patsplit() feature found in gawk >= 4.0.0 mentioned in a previous answer.

    Reference

提交回复
热议问题