Parsing a CSV file using gawk

后端未结

关注

 9  1225

How do you parse a CSV file using gawk? Simply setting FS=\",\" is not enough, as a quoted field with a comma inside will be treated as multiple fields.

相关标签:

9条回答

独厮守ぢ

2020-11-29 12:32

The gawk version 4 manual says to use FPAT = "([^,]*)|(\"[^\"]+\")"

When FPAT is defined, it disables FS and specifies fields by content instead of by separator.

0 讨论(0)
发布评论:

提交评论
- 加载中...

佛祖请我去吃肉

2020-11-29 12:32

Here's what I came up with. Any comments and/or better solutions would be appreciated.

BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) {
    f[++n] = $i
    if (substr(f[n],1,1)=="\"") {
      while (substr(f[n], length(f[n]))!="\"" || substr(f[n], length(f[n])-1, 1)=="\\") {
        f[n] = sprintf("%s,%s", f[n], $(++i))
      }
    }
  }
  for (i=1; i<=n; i++) printf "field #%d: %s\n", i, f[i]
  print "----------------------------------\n"
}

The basic idea is that I loop through the fields, and any field which starts with a quote but does not end with a quote gets the next field appended to it.

0 讨论(0)

予麋鹿

2020-11-29 12:42

{
  ColumnCount = 0
  $0 = $0 ","                           # Assures all fields end with comma
  while($0)                             # Get fields by pattern, not by delimiter
  {
    match($0, / *"[^"]*" *,|[^,]*,/)    # Find a field with its delimiter suffix
    Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
    gsub(/^ *"?|"? *,$/, "", Field)     # Strip delimiter text: comma/space/quote
    Column[++ColumnCount] = Field       # Save field without delimiter in an array
    $0 = substr($0, RLENGTH + 1)        # Remove processed text from the raw data
  }
}

Patterns that follow this one can access the fields in Column[]. ColumnCount indicates the number of elements in Column[] that were found. If not all rows contain the same number of columns, Column[] contains extra data after Column[ColumnCount] when processing the shorter rows.

This implementation is slow, but it appears to emulate the FPAT/patsplit() feature found in gawk >= 4.0.0 mentioned in a previous answer.

Reference

0 讨论(0)

猫巷女王i

2020-11-29 12:50
You can use a simple wrapper function called csvquote to sanitize the input and restore it after awk is done processing it. Pipe your data through it at the start and end, and everything should work out ok:

before:
```
gawk -f mypgoram.awk input.csv
```
after:
```
csvquote input.csv | gawk -f mypgoram.awk | csvquote -u
```
See https://github.com/dbro/csvquote for code and documentation.
0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2020-11-29 12:53

I am not exactly sure whether this is the right way to do things. I would rather work on a csv file in which either all values are to quoted or none. Btw, awk allows regexes to be Field Separators. Check if that is useful.

0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2020-11-29 12:54
Perl has the Text::CSV_XS module which is purpose-built to handle the quoted-comma weirdness.
Alternately try the Text::CSV module.

perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv

Produces this output:
```
field #0: one
field #1: two
field #2: three, four
field #3: five
---
field #0: six, seven
field #1: eight
field #2: nine
---
```
Here's a human-readable version.
Save it as parsecsv, chmod +x, and run it as "parsecsv file.csv"
```
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
open(my $data, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' $!\n";
while (my $line = <$data>) {
    if ($csv->parse($line)) {
        my @f = $csv->fields();
        for my $n (0..$#f) {
            print "field #$n: $f[$n]\n";
        }
        print "---\n";
    }
}
```
You may need to point to a different version of perl on your machine, since the Text::CSV_XS module may not be installed on your default version of perl.
```
Can't locate Text/CSV_XS.pm in @INC (@INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .).
BEGIN failed--compilation aborted.
```
If none of your versions of Perl have Text::CSV_XS installed, you'll need to:
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页