Parsing a CSV file using gawk

后端未结

关注

 9  1256

感动是毒 2020-11-29 12:12

How do you parse a CSV file using gawk? Simply setting FS=\",\" is not enough, as a quoted field with a comma inside will be treated as multiple fields.

9条回答

予麋鹿 (楼主)

2020-11-29 12:42

{
  ColumnCount = 0
  $0 = $0 ","                           # Assures all fields end with comma
  while($0)                             # Get fields by pattern, not by delimiter
  {
    match($0, / *"[^"]*" *,|[^,]*,/)    # Find a field with its delimiter suffix
    Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
    gsub(/^ *"?|"? *,$/, "", Field)     # Strip delimiter text: comma/space/quote
    Column[++ColumnCount] = Field       # Save field without delimiter in an array
    $0 = substr($0, RLENGTH + 1)        # Remove processed text from the raw data
  }
}

Patterns that follow this one can access the fields in Column[]. ColumnCount indicates the number of elements in Column[] that were found. If not all rows contain the same number of columns, Column[] contains extra data after Column[ColumnCount] when processing the shorter rows.

This implementation is slow, but it appears to emulate the FPAT/patsplit() feature found in gawk >= 4.0.0 mentioned in a previous answer.

Reference

0 讨论(0)

查看其它9个回答