Powershell - Count number of carriage returns line feed in .txt file

无人久伴 提交于 2021-02-04 21:07:55

问题


I have a large text file (output from SQL db) and I need to determine the row count. However, since the source SQL data itself contains carriage returns \r and line feeds \n (NEVER appearing together), the data for some rows spans multiple lines in the output .txt file. The Powershell I'm using below gives me the file line count which is greater than the actual SQL row count. So I need to modify the script to ignore the additional lines - one way of doing it might be just counting the number of times CRLF or \r\n occurs (TOGETHER) in the file and that should be the actual number of rows but I'm not sure how to do it.

Get-ChildItem "." |% {$n = $_; $c = 0; Get-Content -Path $_ -ReadCount 1000 |% { $c += $_.Count }; "$n; $c"} > row_count.txt

回答1:


I just learned myself that the Get-Content splits and streams each lines in a file by CR, CRLF, and LF sothat it can read data between operating systems interchangeably:

"1`r2`n3`r`n4" | Out-File .\Test.txt
(Get-Content .\Test.txt).Count
4

Reading the question again, I might have misunderstood your question.
In any case, if you want to split (count) on only a specific character combination:

CR

((Get-Content -Raw .\Test.txt).Trim() -Split '\r').Count
3

LF

((Get-Content -Raw .\Test.txt).Trim() -Split '\n').Count
3

CRLF

((Get-Content -Raw .\Test.txt).Trim() -Split '\r\n').Count # or: -Split [Environment]::NewLine
2

Note .Trim() method which removes the extra newline (white spaces) at the end of the file added by the Get-Content -Raw parameter.


Addendum

(Update based on the comment on the memory exception)
I am afraid that there is currently no other option then building your own StreamReader using the ReadBlock method and specifically split lines on a CRLF. I have opened a feature request for this issue: -NewLine Parameter to customize line separator for Get-Content

Get-Lines

A possible way to workaround the memory exception errors:

function Get-Lines {
    [CmdletBinding()][OutputType([string])] param(
        [Parameter(ValueFromPipeLine = $True)][string] $Filename,
        [String] $NewLine = [Environment]::NewLine
    )
    Begin {
        [Char[]] $Buffer = new-object Char[] 10
        $Reader = New-Object -TypeName System.IO.StreamReader -ArgumentList (Get-Item($Filename))
        $Rest = '' # Note that a multiple character newline (as CRLF) could be split at the end of the buffer
    }
    Process {
       While ($True) {
            $Length = $Reader.ReadBlock($Buffer, 0, $Buffer.Length)
            if (!$length) { Break }
            $Split = ($Rest + [string]::new($Buffer[0..($Length - 1)])) -Split $NewLine
            If ($Split.Count -gt 1) { $Split[0..($Split.Count - 2)] }
            $Rest = $Split[-1]
        }
    }
    End {
        $Rest
    }
}

Usage

To prevent the memory exceptions it is important that you do not assign the results to a variable or use brackets as this will stall the PowerShell PowerShell pipeline and store everything in memory.

$Count = 0
Get-Lines .\Test.txt | ForEach-Object { $Count++ }
$Count



回答2:


  • The System.IO.StreamReader.ReadBlock solution that reads the file in fixed-size blocks and performs custom splitting into lines in iRon's helpful answer is the best choice, because it both avoids out-of-memory problems and performs well (by PowerShell standards).

  • If performance in terms of execution speed isn't paramount, you can take advantage of
    Get-Content's -Delimiter parameter, which accepts a custom string to split the file content by:

# Outputs the count of CRLF-terminated lines.
(Get-Content largeFile.txt -Delimiter "`r`n" | Measure-Object).Count

Note that -Delimiter employs optional-terminator logic when splitting: that is, if the file content ends in the given delimiter string, no extra, empty element is reported at the end.

This is consistent with the default behavior, where a trailing newline in a file is considered an optional terminator that does not resulting in an additional, empty line getting reported.

However, in case a -Delimiter string that is unrelated to newline characters is used, a trailing newline is considered a final "line" (element).

A quick example:

# Create a test file without a trailing newline.
# Note the CR-only newline (`r) after 'line 1'
"line1`rrest of line1`r`nline2" | Set-Content -NoNewLine test1.txt

# Create another test file with the same content plus 
# a trailing CRLF newline.
"line1`rrest of line1`r`nline2`r`n" | Set-Content -NoNewLine test2.txt

'test1.txt', 'test2.txt' | ForEach-Object {
  "--- $_"
  # Split by CRLF only and enclose the resulting lines in [...]
  Get-Content $_ -Delimiter "`r`n" | 
    ForEach-Object { "[{0}]" -f ($_ -replace "`r", '`r') }
}

This yields:

--- test1.txt
[line1`rrest of line1]
[line2]
--- test2.txt
[line1`rrest of line1]
[line2]

As you can see, the two test files were processed identically, because the trailing CRLF newline was considered an optional terminator for the last line.



来源:https://stackoverflow.com/questions/65813430/powershell-count-number-of-carriage-returns-line-feed-in-txt-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!