I have a csv file where each row defines a room in a given building. Along with room, each row has a floor field. What I want to extract is all floors in all buildings. <
Since the problem is really to distinguish between a comma inside a CSV field and the one that separates fields, we can replace the first kind of comma with something else so that it easier to parse further, i.e., something like this:
0,"00BDF","AIRPORT TEST "
0,0,"BRICKER HALL JOHN W "
This gawk script (replace-comma.awk) does that:
BEGIN { RS = "(.)" }
RT == "\x022" { inside++; }
{ if (inside % 2 && RT == ",") printf(""); else printf(RT); }
This uses a gawk feature that captures the actual record separator into a variable called RT. It splits every character into a record, and as we are reading through the records, we replace the comma encountered inside a quote (\x022) with .
The FPAT solution fails in one special case where you have both escaped quotes and a comma inside quotes but this solution works in all cases, i.e,
§ echo '"Adams, John ""Big Foot""",1' | gawk -vFPAT='[^,]*|"[^"]*"' '{ print $1 }'
"Adams, John "
§ echo '"Adams, John ""Big Foot""",1' | gawk -f replace-comma.awk | gawk -F, '{ print $1; }'
"Adams John ""Big Foot""",1
As a one-liner for easy copy-paste:
gawk 'BEGIN { RS = "(.)" } RT == "\x022" { inside++; } { if (inside % 2 && RT == ",") printf(""); else printf(RT); }'