GIT pre-commit hook which searches non-UTF-8 encodings among modified/added files (and rejects commit if it finds any)

南笙酒味 提交于 2019-12-24 10:24:53

问题


I'm using Git for Windows (and TortoiseGit).

My goal is to prevent commits which have at least one non-UTF-8 file among modified/added.

  • Enumerating modified/added files: I've found the following code

    { git diff --name-only ; git diff --name-only --staged ; }
    

    Is this the best (correct and most concise) approach?

  • Searching for non-UTF-8 files: I've found the following code

    { git diff --name-only ; git diff --name-only --staged ; } | xargs -I {} bash -c "iconv -f utf-8 -t utf-16 {} &>/dev/null || echo {} - is non-UTF8!"
    

    If I start Git Bash at my repository root folder - it works (each non-UTF-8 file is displayed). So I've renamed .git/hooks/pre-commit.sample to .git/hooks/pre-commit and copy-pasted the code above. After committing changes nothing special displays inside TortoiseGit commit gui window. So looks like pre-commit hook is not working correctly.

  • Rejecting commit if there is any non-UTF-8 file: After displaying all non-UTP-8 files commit should be rejected. But I have no idea how to do this (show some exit code - but how?).

So any help is appreciated.


回答1:


So the answer is (thx to phd and great thx to torek for his useful notes):

    git diff --name-only --staged --diff-filter d | xargs -I {} bash -c 
 "iconv -f utf-8 -t utf-16 {} &>/dev/null || { echo {} - is non-UTF8!; exit 1; }"

This code iterates through all files, that changed in commit (except for deleted - i.e. added, modified, copied and renamed) and checks if there is any non-UTF8 file. All found files are listed and commit is aborted.




回答2:


Your existing solution is probably sufficient. It's not 100% correct though: here are the remaining issues, all of which are minor ones that you can fix later (if ever) at your leisure:

  • You need only the git diff ... --staged (or --cached), as what Git will commit is whatever files are in the index/staging-area, and git diff compares that with what's in the HEAD commit and tells you what's different there. If a copy of a file in the index differs from the copy of the file in HEAD, you should examine the index copy.

  • Technically it would be better to use git diff-index --cached here so as to not obey any of the user's git diff configuration. That is, git diff-index is a plumbing command in Git, which means it's aimed at being used from other computer programs: it runs in a completely predictable manner based on arguments only, not on any git config settings. But if you're doing this for yourself, and you configure git diff such that it breaks your own use of git diff, well, that's your own fault. :-)

  • You might also consider using a --diff-filter to exclude deleted files here. Otherwise your checker will always fail on deletion (as iconv won't be able to read the deleted file).

  • Most significant: iconv will be reading the file from the work-tree. As I noted in the first bullet point, Git is going to commit what's staged, not what's in the work-tree.

As an example—which may or may not be possible from within TortoiseGit—consider what happens if you do this:

$ git checkout master
$ printf '\300\300\300' > badfile    # put bad non-UTF-8 crud into file
$ git add badfile                    # copy file into index
$ echo 'good data' > badfile         # replace work-tree contents
$ git commit

This commit is going to commit the bad contents—the three bytes of \300 with no newline—that are in the index, but your pre-commit hook is going to run iconv -f utf-8 -t utf-16 over the contents of the good file, reading good data, that is of course good.

To fix this, your pre-commit filter must extract the data from the index for each file that is to be committed. How you go about doing that is up to you. The simplest (but perhaps slowest) method is to just extract the entire index contents to a temporary work area using git checkout-index. A better method might be to turn each in-index (in-staging-area) path name to valid index specifier (that is, path/to/file becomes :path/to/file) and use git cat-file -p $specifier | iconv ... to scan each. But all of these will be fairly inefficient, especially on Windows. For efficiency, you might want to write a Python script that uses git cat-file --batch to extract them all in one pass, and do the format-checking there.



来源:https://stackoverflow.com/questions/55645733/git-pre-commit-hook-which-searches-non-utf-8-encodings-among-modified-added-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!