Using awk or perl to extract specific columns from CSV (parsing)

问题

Background - I want to extract specific columns from a csv file. The csv file is comma delimited, uses double quotes as the text-qualifier (optional, but when a field contains special characters, the qualifier will be there - see example), and uses backslashes as the escape character. It is also possible for some fields to be blank.

Example Input and Desired Output - For example, I only want columns 1, 3, and 4 to be in the output file. The final extract of the columns from the csv file should match the format of the original file. No escape characters should be removed or extra quotes added and such.

Input

"John \"Super\" Doe",25,"123 ABC Street",123-456-7890,"M",A
"Jane, Mary","",132 CBS Street,333-111-5332,"F",B
"Smith \"Jr.\", Jane",35,,555-876-1233,"F",
"Lee, Jack",22,123 Sesame St,"","M",D

Desired Output

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""

Preliminary Script (awk) - The following is a preliminary script I found that works for the most part, but does not work in one particular instance that I noticed and possibly more that I have not seen or thought of yet

#!/usr/xpg4/bin/awk -f

BEGIN{  OFS = FS = ","  }

/"/{
    for(i=1;i<=NF;i++){
        if($i ~ /^"[^"]+$/){
            for(x=i+1;x<=NF;x++){
                $i=$i","$x
                if($i ~ /"+$/){
                    z = x - (i + 1) + 1
                    for(y=i+1;y<=NF;y++)
                        $y = $(y + z)
                    break
                }
            }
            NF = NF - z
            i=x
        }
    }
print $1,$3,$4
}

The above seems to work well until it comes across a field that contains both escaped double quotes as well as a comma. In that case, the parsing will be off and the output will be incorrect.

Question/Comments - I have read that awk is not the best option for parsing through csv files, and perl is suggested. However, I do not know perl at all. I have found some examples of perl scripts, but they do not give the desired output I am looking for and I do not know how to edit the scripts easily for what I want.

As for awk, I am familiar with it and use the basic functionality of it occasionally, but I do not know a lot of the advanced functionality like some of the commands used in the script above. Is my desired output possible just by using awk? If so, would it be possible edit the script above to fix the issue I am having with it? Could someone explain line by line what exactly the script is doing?

Any help would be appreciated, thanks!

回答1:

I'm not going to reinvent the wheel.

use Text::CSV_XS;

my $csv = Text::CSV_XS->new({
   binary      => 1,
   escape_char => '\\',
   eol         => "\n",
});

my $fh_in  = \*STDIN;
my $fh_out = \*STDOUT;

while (my $row = $csv->getline($fh_in)) {
   $csv->print($fh_out, [ @{$row}[0,2,3] ])
      or die("".$csv->error_diag());
}

$csv->eof()
   or die("".$csv->error_diag());

Output:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary","132 CBS Street",333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack","123 Sesame St",

It adds quotes around addresses that didn't have any already, but since some addresses already have quotes around them, you obviously can handle that.

Reinventing the wheel:

my $field = qr/"(?:[^"\\]|\\.)*"|[^"\\,]*/s;
while (<>) {
   my @fields = /^($field),$field,($field),($field),/
      or die;
   print(join(',', @fields), "\n");
}

Output:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""

回答2:

I'd suggest python csv module:

#!/usr/bin/env python3
import csv
rdr = csv.reader(open('input.csv'), escapechar='\\')
wtr = csv.writer(open('output.csv', 'w'), escapechar='\\', doublequote=False)
for row in rdr:
    wtr.writerow(row[0:1]+row[2:4])

output.csv

John \"Super\" Doe,123 ABC Street,123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,

回答3:

The following command will extract the required fields(e.g., first, third and fourth) separated by delimiter ',' from sample.csv file and displays the output in the console. cut -f1,3,4 -d',' sample.txt If you want to store the output in a new csv file, then redirect the output to a file as below cut -f1,3,4 -d',' sample.txt > newSample.csv

回答4:

Before I post, I see now that this is an old question bumped by an already deleted answer, however, I thought I would still use the opportunity to show off Tie::Array::CSV which make CSV file manipulation as easy as working with Perl arrays. Full disclosure: I'm the author.

Anyway here is the script. The OP's data required changing the escape character and Perl indexes arrays starting at 0, but other than that this should be quite readable.

#!/usr/bin/env perl

use strict;
use warnings;

use Tie::Array::CSV;

my $opts = { text_csv => { escape_char => '\\' } };

tie my @input,  'Tie::Array::CSV', 'data', $opts or die "Cannot open file 'data': $!";
tie my @output, 'Tie::Array::CSV', 'out',  $opts or die "Cannot open file 'out': $!";

for my $row (@input) {
  my @slice = @{ $row }[0,2,3];
  push @output, \@slice;
}

That said, I think that last loop doesn't loose too much readability if I convert it to the (IMO) more impressive form:

push @output, [ @{$_}[0,2,3] ] for @input;

回答5:

csvkit is a tool that handles csv files and allows such operations (among other features).

see csvcut. Its command line interface is compact and it handles the multitude of csv formats (tsv, other delimiters, encodings, escape chars etc.)

What you asked for can be done using:

csvcut --columns 0,2,3 input.csv

回答6:

GNU awk solution. Just using the wheel as a wheel. You can define what fields should look like using FPAT, like this:

$ awk -vFPAT='[^,]+|"[^"]*"' -vOFS=, '{print $1, $3, $4}' file

which results in:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\",35,555-876-1233
"Lee, Jack",123 Sesame St,""

Explanation of the regex:

[^,]+           # 1 or more occurrences of anything that's not a comma, 
|               # OR
"[^"]*"         # 0 or more characters unequal to '"' enclosed by '"'

Read about FPAT in the gawk manual

Now, walking you through your script. Basically it tries to rewrite what your fields look like. At first, you split by ",", which obviously causes some problems. Next, it looks for fields that are not properly closed by a '"'.

BEGIN{OFS=FS =","}                        # set field sep (FS) and output field 
                                          #   sep to ,
/"/{                                      # for each line matching '"'
    for(i=1;i<=NF;i++){                   # loop through fields 1 to NF
        if($i ~ /^"[^"]+$/){              # IF field $i start with '"', followed by
                                          #   non-quotes
            for(x=i+1;x<=NF;x++){         # loop through ALL following fields
                $i=$i","$x                # concatenate field $i with ALL following 
                                          #   fields, separated by ","
                if($i ~ /"+$/){           # IF field $i ends with '"'
                    z = x - (i + 1) + 1   # z is index of field we're looking at next
                    for(y=i+1;y<=NF;y++)  
                        $y = $(y + z)     # change contents of following fields to 
                                          #   contents of field, z steps further
                                          #   down the line
                    break                 # break out of for(x) loop
                }
            }
            NF = NF - z                   # reset number of fields
            i=x                           # continue loop for(i) at index x
        }
    }
 print $1,$3,$4
}

You script fails on this input line:

"Smith \"Jr.\", Jane",35,,555-876-1233,"F",

simply because $i ~ /^"[^"]+$/ fails on $1.

I hope you agree with me that rewriting the fields like this can be tricky. More than that, it's like "O, I like awk, but I'm going to use it like C/perl/python." Using FPAT is a shorter solution, to say the least.

回答7:

I made some mistakes hopefully corrected now.

awk '{sub(/y",""/,"y\42")sub(/,2.|,3./,"")sub(/,".",.*/,"")}1' file

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""

来源：https://stackoverflow.com/questions/9287770/using-awk-or-perl-to-extract-specific-columns-from-csv-parsing

标签

perl

parsing

csv

awk