Dealing with commas in a CSV file

后端 未结 27 3367
傲寒
傲寒 2020-11-21 06:53

I am looking for suggestions on how to handle a csv file that is being created, then uploaded by our customers, and that may have a comma in a value, like a company name.

27条回答
  •  萌比男神i
    2020-11-21 07:07

    In case you're on a *nix-system, have access to sed and there can be one or more unwanted commas only in a specific field of your CSV, you can use the following one-liner in order to enclose them in " as RFC4180 Section 2 proposes:

    sed -r 's/([^,]*,[^,]*,[^,]*,)(.*)(,.*,.*)/\1"\2"\3/' inputfile
    

    Depending on which field the unwanted comma(s) may be in you have to alter/extend the capturing groups of the regex (and the substitution).
    The example above will enclose the fourth field (out of six) in quotation marks.

    enter image description here

    In combination with the --in-place-option you can apply these changes directly to the file.

    In order to "build" the right regex, there's a simple principle to follow:

    1. For every field in your CSV that comes before the field with the unwanted comma(s) you write one [^,]*, and put them all together in a capturing group.
    2. For the field that contains the unwanted comma(s) you write (.*).
    3. For every field after the field with the unwanted comma(s) you write one ,.* and put them all together in a capturing group.

    Here is a short overview of different possible regexes/substitutions depending on the specific field. If not given, the substitution is \1"\2"\3.

    ([^,]*)(,.*)                     #first field, regex
    "\1"\2                           #first field, substitution
    
    (.*,)([^,]*)                     #last field, regex
    \1"\2"                           #last field, substitution
    
    
    ([^,]*,)(.*)(,.*,.*,.*)          #second field (out of five fields)
    ([^,]*,[^,]*,)(.*)(,.*)          #third field (out of four fields)
    ([^,]*,[^,]*,[^,]*,)(.*)(,.*,.*) #fourth field (out of six fields)
    

    If you want to remove the unwanted comma(s) with sed instead of enclosing them with quotation marks refer to this answer.

提交回复
热议问题