Parse a csv using awk and ignoring commas inside a field

前端 未结 7 1188
抹茶落季
抹茶落季 2020-11-29 04:36

I have a csv file where each row defines a room in a given building. Along with room, each row has a floor field. What I want to extract is all floors in all buildings. <

7条回答
  •  不知归路
    2020-11-29 04:50

    Since the problem is really to distinguish between a comma inside a CSV field and the one that separates fields, we can replace the first kind of comma with something else so that it easier to parse further, i.e., something like this:

    0,"00BDF","AIRPORT TEST            "
    0,0,"BRICKER HALL JOHN W    "
    

    This gawk script (replace-comma.awk) does that:

    BEGIN { RS = "(.)" } 
    RT == "\x022" { inside++; } 
    { if (inside % 2 && RT == ",") printf(""); else printf(RT); }
    

    This uses a gawk feature that captures the actual record separator into a variable called RT. It splits every character into a record, and as we are reading through the records, we replace the comma encountered inside a quote (\x022) with .

    The FPAT solution fails in one special case where you have both escaped quotes and a comma inside quotes but this solution works in all cases, i.e,

    § echo '"Adams, John ""Big Foot""",1' | gawk -vFPAT='[^,]*|"[^"]*"' '{ print $1 }'
    "Adams, John "
    § echo '"Adams, John ""Big Foot""",1' | gawk -f replace-comma.awk | gawk -F, '{ print $1; }'
    "Adams John ""Big Foot""",1
    

    As a one-liner for easy copy-paste:

    gawk 'BEGIN { RS = "(.)" } RT == "\x022" { inside++; } { if (inside % 2 && RT == ",") printf(""); else printf(RT); }'
    

提交回复
热议问题