How can I parse quoted CSV in Perl with a regex?

前端 未结 7 1445
青春惊慌失措
青春惊慌失措 2020-11-30 09:10

I\'m having some issues with parsing CSV data with quotes. My main problem is with quotes within a field. In the following example lines 1 - 4 work correctly but 5,6 and 7 d

相关标签:
7条回答
  • 2020-11-30 09:35

    Please, Try Using CPAN

    There's no reason you couldn't download a copy of Text::CSV, or any other non-XS based implementation of a CSV parser and install it in your local directory, or in a lib/ sub directory of your project so its installed along with your projects rollout.

    If you can't store text files in your project, then I'm wondering how it is you are coding your project.

    http://novosial.org/perl/life-with-cpan/non-root/

    Should be a good guide on how to get these into a working state locally.

    Not using CPAN is really a recipe for disaster.

    Please consider this before trying to write your own CSV implementation.

    Text::CSV is over a hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way.

    note: I learnt this the hard way. Took me a full day to get a working CSV parser in PHP before I discovered an inbuilt one had been added in a later version. It really is something awful.

    0 讨论(0)
  • 2020-11-30 09:43

    This works like charm

    line is assumed to be comma separated with embeded ,

    my @columns = Text::ParseWords::parse_line(',', 0, $line);

    0 讨论(0)
  • 2020-11-30 09:48

    Finding matching pairs using regexs is non-trivial and generally unsolvable task. There are plenty of examples in the Jeffrey Friedl's Mastering regular expressions book. I don't have it at hand now, but I remember that he used CSV for some examples, too.

    0 讨论(0)
  • 2020-11-30 09:48

    You can (try to) use CPAN.pm to simply have your program install/update Text::CSV. As said before, you can even "install" it to a home or local directory, and add that directory to @INC (or, if you prefer not to use BEGIN blocks, you can use lib 'dir'; - it's probably better).

    0 讨论(0)
  • 2020-11-30 09:50

    You can parse CSV using Text::ParseWords which ships with Perl.

    use Text::ParseWords;
    
    while (<DATA>) {
        chomp;
        my @f = quotewords ',', 0, $_;
        say join ":" => @f;
    }
    
    __DATA__
    COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
    S,"BELT,FAN",003541547,
    S,"BELT V,FAN",000324244,
    S,SHROUD SPRING SCREW,000868265,
    S,"D" REL VALVE ASSY,000771881,
    S,"YBELT,"V"",000323030,
    S,"YBELT,'V'",000322933,
    

    which parses your CSV correctly....

    # => COLLOQ_TYPE:COLLOQ_NAME:COLLOQ_CODE:XDATA
    # => S:BELT,FAN:003541547:
    # => S:BELT V,FAN:000324244:
    # => S:SHROUD SPRING SCREW:000868265:
    # => S:D REL VALVE ASSY:000771881:
    # => S:YBELT,V:000323030:
    # => S:YBELT,'V':000322933:
    

    The only issue I've had with Text::ParseWords is when nested quotes in data aren't escaped correctly. However this is badly built CSV data and would cause problems with most CSV parsers ;-)

    So you may notice that

    # S,"YBELT,"V"",000323030,
    

    came out as (ie. quotes dropped around "V")

    # S:YBELT,V:000323030:
    

    however if its escaped like so

    # S,"YBELT,\"V\"",000323030,
    

    then quotes will be retained

    # S:YBELT,"V":000323030:
    
    0 讨论(0)
  • 2020-11-30 09:51

    Tested:

    
    use Test::More tests => 2;
    
    use strict;
    
    sub splitCommaNotQuote {
        my ( $line ) = @_;
    
        my @fields = ();
    
        while ( $line =~ m/((\")([^\"]*)\"|[^,]*)(,|$)/g ) {
            if ( $2 ) {
                push( @fields, $3 );
            } else {
                push( @fields, $1 );
            }
            last if ( ! $4 );
        }
    
        return( @fields );
    }
    
    is_deeply(
        +[splitCommaNotQuote('S,"D" REL VALVE ASSY,000771881,')],
        +['S', '"D" REL VALVE ASSY', '000771881', ''],
        "Quote in value"
    );
    is_deeply(
        +[splitCommaNotQuote('S,"BELT V,FAN",000324244,')],
        +['S', 'BELT V,FAN', '000324244', ''],
        "Strip quotes from entire value"
    );
    
    0 讨论(0)
提交回复
热议问题