How can I keep UNIX LF line endings?

百般思念 提交于 2021-02-08 07:25:41

问题


I have a large (9 GiB), ASCII encoded, pipe delimited file with UNIX-style line endings; 0x0A.

I want to sample the first 100 records into a file for investigation. The following will produce 100 records (1 header record and 99 data records). However, it changes the line endings to DOS/Winodws style; CRLF, 0x0D0A.

Get-Content -Path .\wellmed_hce_elig_20191223.txt |
    Select-Object -first 100 |
    Out-File -FilePath .\elig.txt -Encoding ascii

I know about iconv, recode, and dos2unix. Those programs are not on my system and are not permitted to be installed. I have searched and found a number of places on how to get to CRLF. I have not found anything on getting to or keeping LF.

How can I produce the file with LF line endings instead of CRLF?


回答1:


You could join the lines from the Get-Content cmdlet with the Unix "`n" newline and save that.

Something like

((Get-Content -Path .\wellmed_hce_elig_20191223.txt | 
        Select-Object -first 100) -join "`n") |
        Out-File -FilePath .\elig.txt -Encoding ascii -NoNewLine



回答2:


To complement Theo's helpful answer with a performance optimization based on the little-used -ReadCount parameter:

Set-Content -NoNewLine -Encoding ascii .\outfile.txt -Value (
  (Get-Content -First 100 -ReadCount 100 .\file.txt) -join "`n") + "`n"
)
  • -First 100 instructs Get-Content to read (at most) 100 lines.

  • -ReadCount 100 causes these 100 lines to be read and emitted at once, as an array, which speeds up reading and subsequent processing.

    • Note: In PowerShell [Core] v7.0+ you can use shorthand -ReadCount 0 in combination with -First <n> to mean: read the requested <n> lines as a single array; due to a bug in earlier versions, including Windows PowerShell, -ReadCount 0 always reads the entire file, even in the presence of -First (aka -TotalCount aka -Head).
      Also, even as of PowerShell [Core] 7.0.0-rc.2 (current as of this writing), combining -ReadCount 0 with -Last <n> (aka -Tail) should be avoided (for now): while output produced is correct, behind the scenes it is again the whole file that is read; see this GitHub issue.
  • Note the + "`n", which ensures that the output file will have a trailing newline as well (which text files in the Unix world are expected to have).

While the above also works with -Last <n> (-Tail <n>) to extract from the end of the file, Theo's (slower) Select-Object solution offers more flexibility with respect to extracting arbitrary ranges of lines, thanks to available parameters -Skip, -SkipLast, and -Index; however, offering these parameters also directly on Get-Content for superior performance is being proposed in this GitHub feature request.

Also note that I've used Set-Content instead of Out-File.
If you know you're writing text, Set-Content is sufficient and generally faster (though in this case this won't matter, given that the data to write is passed as a single value).

For a comprehensive overview of the differences between Set-Content and Out-File / >, see this answer.


Set-Content vs. Out-File benchmark:

Note: This benchmark compares the two cmdlets with respect to writing many input strings received via the pipeline to a file.

# Sample array of 100,000 lines.
$arr = (, 'foooooooooooooooooooooo') * 1e5
# Time writing the array lines to a file, first with Set-Content, then
# with Out-File.
$file = [IO.Path]::GetTempFileName()
{ $arr | Set-Content -Encoding Ascii $file }, 
{ $arr | Out-File -Encoding Ascii $file } | % { (Measure-Command $_).TotalSeconds }
Remove-Item $file

Sample timing in seconds from my Windows 10 VM with Windows PowerShell v5.1:

2.6637108 # Set-Content
5.1850954 # Out-File; took almost twice as long.


来源:https://stackoverflow.com/questions/60157755/how-can-i-keep-unix-lf-line-endings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!