How can I remove the BOM from a UTF-8 file?

强颜欢笑 提交于 2019-11-28 09:51:51

A BOM is Unicode codepoint U+FEFF; the UTF-8 encoding consists of the three hex values 0xEF, 0xBB, 0xBF.

With bash, you can create a UTF-8 BOM with the $'' special quoting form, which implements Unicode escapes: $'\uFEFF'. So with bash, a reliable way of removing a UTF-8 BOM from the beginning of a text file would be:

sed -i $'1s/^\uFEFF//' file.txt

This will leave the file unchanged if it does not start with a UTF-8 BOM, and otherwise remove the BOM.

If you are using some other shell, you might find that "$(printf '\ufeff')" produces the BOM character (that works with zsh as well as any shell without a printf builtin, provided that /usr/bin/printf is the Gnu version ), but if you want a Posix-compatible version you could use:

sed "$(printf '1s/^\357\273\277//)" file.txt

(The -i in-place edit flag is also a Gnu extension; this version writes the possibly-modified file to stdout.)

Using VIM

  1. Open file in VIM:

    vi text.xml
    
  2. Remove BOM encoding:

    :set nobomb
    
  3. Save and quit:

    :wq
    

It is possible to remove the BOM from a file with the tail command:

tail --bytes=+4 withBOM.txt > withoutBOM.txt

Well, just dealt with this today and my preferred way was dos2unix:

dos2unix will remove BOM and also take care of other idiosyncrasies from other SOs:

$ sudo apt install dos2unix
$ dos2unix test.xml

It's also possible to remove BOM only (-r, --remove-bom):

$ dos2unix -r test.xml

Note: tested with dos2unix 7.3.4

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!