Get newline stats for a text file in Python

问题

I had a nasty CRLF / LF conflict in git file that was probably committed from Windows machine. Is there a cross-platform way (preferably in Python) to detect what type of newlines is dominant through the file?

I've got this code (based on idea from https://stackoverflow.com/a/10562258/239247):

import sys
if not sys.argv[1:]:
  sys.exit('usage: %s <filename>' % sys.argv[0])

with open(sys.argv[1],"rb") as f:
  d = f.read()
  crlf, lfcr = d.count('\r\n'), d.count('\n\r')
  cr, lf = d.count('\r'), d.count('\n')
  print('crlf: %s' % crlf)
  print('lfcr: %s' % lfcr)
  print('cr: %s' % cr)
  print('lf: %s' % lf)
  print('\ncr-crlf-lfcr: %s' % (cr - crlf - lfcr))
  print('lf-crlf-lfcr: %s' % (lf - crlf - lfcr))
  print('\ntotal (lf+cr-2*crlf-2*lfcr): %s\n' % (lf + cr - 2*crlf - 2*lfcr))

But it gives the stats wrong (for this file):

crlf: 1123
lfcr: 58
cr: 1123
lf: 1123

cr-crlf-lfcr: -58
lf-crlf-lfcr: -58

total (lf+cr-2*crlf-2*lfcr): -116

回答1:

import sys


def calculate_line_endings(filename):
    cr = lf = crlf = lfcr = 0
    for line in open(filename, "rb"):
        if line.endswith('\r\n'):
            crlf += 1
        elif line.endswith('\n\r'):
            lfcr += 1
        elif line.endswith('\r'):
            cr += 1
        elif line.endswith('\n'):
            lf += 1

    print('crlf: %s' % crlf)
    print('lfcr: %s' % lfcr)
    print('cr: %s' % cr)
    print('lf: %s' % lf)


if __name__ == '__main__':
    if len(sys.argv) == 1:
        sys.exit('usage: %s <filename>' % sys.argv[0])
    else:
        calculate_line_endings(sys.argv[1])

Gives output for your file

crlf: 1123
lfcr: 0
cr: 0
lf: 0

Is it enough?

回答2:

The best way to deal with line endings in git is to use git configuration. You can define what exactly must be done to line endings globally, in a particular repository or for specific files. In .gitattributes file, you can define that certain files must be converted to the native line endings of your system for each checkout, and converted back at checkins. See GitHub line endings help for a detailed description.

回答3:

From what I see, I would recommend to check if you have the following case: \r\n\r\n\r\n. Following your code this will count the following:

crlf: 3 -- [\r\n][\r\n][\r\n]
lfcr: 2 -- \r[\n\r][\n\r]\n
cr: 3   -- [\r]\n[\r]\n[\r]\n
lf: 3   -- \r[\n]\r[\n]\r[\n]

cr-crlf-lfcr: -2
lf-crlf-lfcr: -2

total (lf+cr-2*crlf-2*lfcr): -4

As you can see some \n's and some \r's are counted twice for crlf and lfcr. Instead you can just read line by line and count the line endings line.endswith(). To get exact statistics for cr and lf then you can count \r\n and \n\r as cr+1 and lf+1.

回答4:

The posted code doesn't work properly bcause Counter is counting characters in the file - it doesn't look for character pairs like \r\n and \n\r.

Here's some Python 2.6 code that finds each occurrence of the 4 EOL markers \r\n, \n\r, \r and \n using a regex. The trick is to look for the \r\n and \n\r pairs before looking for the single char EOL markers.

For testing purposes it creates some random text data; I wrote this before I noticed your link to a test file.

#!/usr/bin/env python

''' Find and count various line ending character combinations

    From http://stackoverflow.com/q/29695861/4014959

    Written by PM 2Ring 2015.04.17
'''

import random
import re
from itertools import groupby

random.seed(42)

#Make a random text string containing various EOL combinations
tokens = list(2*'ABCDEFGHIJK ' + '\r\n') + ['\r\n', '\n\r']
datasize = 300
data = ''.join([random.choice(tokens) for _ in range(datasize)])
print repr(data), '\n'

#regex to find various EOL combinations
pat = re.compile(r'\r\n|\n\r|\r|\n')

eols = pat.findall(data)
print eols, '\n'

grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))]
print sorted(grouped, reverse=True)

output

'FAHGIG\rC AGCAFGDGEKAKHJE\r\nJCC EKID\n\rKD F\rEHBGICGCHFKKFH\r\nGFEIEK\n\rFDH JGAIHF\r\n\rIG \nAHGDHE\n G\n\rCCBDFK BK\n\rC\n\r\rAIHDHFDAA\r\n\rHCF\n\rIFFEJDJCAJA\r\n\r IB\r\r\nCBBJJDBDH\r FDIFI\n\rGACDGJEGGBFG\n\rBGGFD\r\nDBJKFCA BIG\n\rC J\rGFA HG\nA\rDB\n\r \n\r\n EBF BK\n\rHJA \r\n\n\rDIEI\n\rEDIBEC E\r\nCFEGGD\rGEF EC\r\nFIG GIIJCA\n\r\n\rCFH\r\n\r\rKE HF\n\rGAKIG\r\nDDCDHEIFFHB\n C HAJFHID AC\r' 

['\r', '\r\n', '\n\r', '\r', '\r\n', '\n\r', '\r\n', '\r', '\n', '\n', '\n\r', '\n\r', '\n\r', '\r', '\r\n', '\r', '\n\r', '\r\n', '\r', '\r', '\r\n', '\r', '\n\r', '\n\r', '\r\n', '\n\r', '\r', '\n', '\r', '\n\r', '\n\r', '\n', '\n\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r', '\n\r', '\r\n', '\n', '\r'] 

[(17, '\n\r'), (14, '\r'), (12, '\r\n'), (5, '\n')]

Here's a version that reads the data from a named file, following the pattern of the code in the question.

import re
from itertools import groupby
import sys

if not sys.argv[1:]:
    exit('usage: %s <filename>' % sys.argv[0])

with open(sys.argv[1], 'rb') as f:
    data = f.read()

print repr(data), '\n'

#regex to find various EOL combinations
pat = re.compile(r'\r\n|\n\r|\r|\n')

eols = pat.findall(data)
print eols, '\n'

grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))]
print sorted(grouped, reverse=True)

来源：https://stackoverflow.com/questions/29695861/get-newline-stats-for-a-text-file-in-python

标签

python

newline