Replace column comma separator in csv file and handle fields with single quotes around value

我与影子孤独终老i 提交于 2021-02-08 07:27:46

问题


A system is producing a csv file which i have no influence over.

There are two columns where the values MAYBE enclosed in a pair of single quotes if the data itself contains commas.

Example Data - 4 columns

123,'abc,def,ghf',ajajaj,1 
345,abdf,'abc,def,ghi',2
556,abdf,def,3
999,'a,b,d','d,e,f',4

Result I want using powershell...

The commas that are not part of the data - meaning those commas that separate the fields are replaced with a specified delimiter (in the case below pipe-star). Those commas that are in between a pair of single quotes remain as commas.

Result

123|*'abc,def,ghf'|*ajajaj|*1 
345|*abdf|*'abc,def,ghi'|*2
556|*abdf|*def|*3
999|*'a,b,d'|*'d,e,f'|*4

I would like to do this is power-shell or c# net if possible using a reg expression however I don't know how to do this.


回答1:


Although I think this would create a strangely formatted CSV file, with PowerShell you can use the switch together with the -Regex and -File parameters. This is probably the fastest way to handle large files and it takes just a few lines of code:

# create a regex that will find comma's unless they are inside single quotes
$commaUnlessQuoted = ",(?=([^']*'[^']*')*[^']*$)"

$result = switch -Regex -File 'D:\test.csv' {
    # added -replace "'" to also remove the single quotes as commented
    default { $_ -replace "$commaUnlessQuoted", '|*' -replace "'" }
}

# output to console
$result

# output to new (sort-of) CSV file
$result | Set-Content -Path 'D:\testoutput.csv'


Update

As mklement0 pointed out the code above does the job, but at the expence of creating the updated data as array in memory completely before writing to the output file.
If this is a problem (file too large to fit the available memory), you can also change the code to read/replace a line from the original and write out that line immediately to the output file.

This next approach will hardly use up any memory, but of course at the expence of doing a lot more write actions on disk..

# make sure this is an absolute path for .NET
$outputFile = 'D:\output.csv'
$inputFile  = 'D:\input.csv'

# create a regex that will find comma's unless they are inside single quotes
$commaUnlessQuoted = ",(?=([^']*'[^']*')*[^']*$)"

# create a StreamWriter object. Uses UTF8Encoding without BOM (Byte Order Mark) by default.
# if you need a different encoding for the output file, use for instance
# $writer = [System.IO.StreamWriter]::new($outputFile, $false, [System.Text.Encoding]::Unicode)
$writer = [System.IO.StreamWriter]::new($outputFile)
switch -Regex -File $inputFile {
    default {
        # added -replace "'" to also remove the single quotes as commented
        $line = $_ -replace "$commaUnlessQuoted", '|*' -replace "'"
        $writer.WriteLine($line)
        # if you want, uncomment the next line to show on console
        # $line
    }
}

# remove the StreamWriter object from memory when done
$writer.Dispose()

Result:

123|*abc,def,ghf|*ajajaj|*1 
345|*abdf|*abc,def,ghi|*2
556|*abdf|*def|*3
999|*a,b,d|*d,e,f|*4

Regex details:

,                 Match the character “,” literally
(?=               Assert that the regex below can be matched, starting at this position (positive lookahead)
   (              Match the regular expression below and capture its match into backreference number 1
      [^']        Match any character that is NOT a “'”
         *        Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      '           Match the character “'” literally
      [^']        Match any character that is NOT a “'”
         *        Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      '           Match the character “'” literally
   )*             Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^']           Match any character that is NOT a “'”
      *           Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   $              Assert position at the end of the string (or before the line break at the end of the string, if any)
)



回答2:


Theo's helpful answer is concise and efficient.

Let me complement with the following solution, which:

  • shows how to parse each CSV row into an array of field values, based on recognizing the embedded '...' quoting (it could easily be adapted to "..." quoting), without including the ' chars. in the output (which aren't syntactically needed anymore, if a delimiter such as | is used instead.

  • shows a faster way to write the output file, using System.IO.File.WriteAllLines

# In and output file paths.
# IMPORTANT: To use file paths with .NET methods, as below, always use
#            FULL PATHS, because .NET's current directory differs from PowerShell's
$inPath = "$PWD/input.csv"
$outPath = "$PWD/output.csv"

[IO.File]::WriteAllLines(
  $outPath,
  # CAVEAT: Even though ReadLines() enumerates *lazily* itself,
  #         applying PowerShell's .ForEach() method to it causes the lines
  #         to all be collected in memory  first.
  [IO.File]::ReadLines($inPath).ForEach({
    # Parse the row into field values, whether they're single-quoted or not.
    $fieldValues = $_ -split "(?x) ,? ( '[^']*' | [^,]* ) ,?" -ne '' -replace "'"
    # Join the field values - without single quotes - to form a row with the
    # new delimiter.
    $fieldValues -join '|'
  })
)

* For brevity I've omitted an important optimization: if (-not $_.Contains("'")) { $_.Replace(",", "|") } could be used to process lines that contain no ' chars. much more quickly.
* -split, the regex-based string splitting operator is used to split the lines into fields.
* Inline option (?x) is used to make the regex more readable, as explained in this answer.

As the code comments state, the solution above still loads the entire file into memory.

Use of the pipeline is required to avoid that, which slows the solution down considerably, however:

& {
 foreach ($line in [IO.File]::ReadLines($inPath)) {
    $fieldValues = $line -split "(?x) ,? ( '[^']*' | [^,]* ) ,?" -ne '' -replace "'"
    $fieldValues -join '|'
  }
} | Set-Content -Encoding Utf8 $outPath

With either solution, the output file ends up containing the following (note the absence of the ' chars.):

123|abc,def,ghf|ajajaj|1
345|abdf|abc,def,ghi|2
556|abdf|def|3
999|a,b,d|d,e,f|4


来源:https://stackoverflow.com/questions/60006585/replace-column-comma-separator-in-csv-file-and-handle-fields-with-single-quotes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!