How to count differences between two files on linux?

后端 未结 7 582
一整个雨季
一整个雨季 2020-12-23 13:45

I need to work with large files and must find differences between two. And I don\'t need the different bits, but the number of differences.

To find the number of dif

7条回答
  •  粉色の甜心
    2020-12-23 14:13

    Here is a way to count any kind of differences between two files, with specified regex for those differences - here . for any character except newline:

    git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l
    

    An excerpt from man git-diff :

    --patience
               Generate a diff using the "patience diff" algorithm.
    --word-diff[=]
               Show a word diff, using the  to delimit changed words. By default, words are delimited by whitespace; see --word-diff-regex below.
               porcelain
                   Use a special line-based format intended for script consumption. Added/removed/unchanged runs are printed in the usual unified diff
                   format, starting with a +/-/` ` character at the beginning of the line and extending to the end of the line. Newlines in the input
                   are represented by a tilde ~ on a line of its own.
    --word-diff-regex=
               Use  to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it
               was already enabled.
               Every non-overlapping match of the  is considered a word. Anything between these matches is considered whitespace and ignored(!)
               for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches
               all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline.
               For example, --word-diff-regex=.  will treat each character as a word and, correspondingly, show differences character by character.
    

    pcre2grep is part of pcre2-utils package on Ubuntu 20.04.

提交回复
热议问题