Remove carriage returns from CSV data value

为君一笑 提交于 2019-12-14 03:24:51

问题


I am importing data from a pipe-delimited CSV to MySQL using a LOAD DATA INFILE statement. I am terminating lines by using '\r\n'. My problem is that some of the data within each row has '\r\n' in it, causing the load to error. I have similar files that just use '\n' within data to indicate linebreaks, and that causes no issues.

Example GOOD CSV

School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New
Jersey
|USA\r

Example BAD CSV

School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New\r
Jersey\r
|USA\r

Is there a way to pre-process the CSV, using sed, awk, or perl, to clean up the extra carriage return in the column values?


回答1:


This is one possible solution in perl. It reads in a line and if there are less than 4 fields, it keeps reading in the next line and merging it until it does have 4 fields. Just change the value of $number_of_fields to the right number.

#!/usr/bin/perl

use strict;
use warnings;

my $number_of_fields=4;

while(<STDIN>)
    {
    s/[\r\n]//g;
    my @fields=split(/\|/);
    next if($#fields==-1);   

    while($#fields<$number_of_fields-1)
        {
        my $nextline=<STDIN> || last;
        $nextline =~ s/[\r\n]//g;
        my @tmpfields=split(/\|/,$nextline);
        next if($#tmpfields==-1);
        $fields[$#fields] .= "\n".$tmpfields[0];
        shift @tmpfields;
        push @fields,@tmpfields;
        }
    print join("|",@fields),"\r\n";
    }



回答2:


With GNU awk for multi-char RS and RT:

$ awk -v RS='([^|]+[|]){3}[^|]+\r\n' -v ORS= '{$0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n")} 1' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M

Note that it assumes the number of fields is 4 so if you have some other number of fields then change 3 to that number minus 1. The script COULD instead calculate the number of fields by reading the first line of your input if that first line cannot have your problem:

$ awk '
    BEGIN { RS="\r\n"; ORS=""; FS="|" }
    FNR==1 { RS="([^|]+[|]){"NF-1"}[^|]+\r\n"; RT=$0 RT }
    { $0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n"); print }
' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M


来源:https://stackoverflow.com/questions/46607729/remove-carriage-returns-from-csv-data-value

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!