Removing binary control characters from a text file

限于喜欢 提交于 2019-12-08 01:55:01

问题


I have a text file that contains binary control characters, such as "^@" and "^M". When I try to perform string operations directly on the text file, the control characters crash the script.

Through trial and error, I discovered that the more command will strip the control characters so that I can process the file properly.

more file_with_control_characters.not_txt > file_without_control_characters.txt

Is this considered a good method, or is there a better way to remove control characters from a text file? Does more have this behavior in OSes earlier than Windows 8?


回答1:


Certainly you do not want to simply remove all control characters. Newline and Tab characters are control characters as well, and you don't want to remove those.

I'm assuming your ^M is a carriage return, and ^@ is a NULL byte. The carriage returns are not causing you problems, and MORE does not remove them. But NULL bytes can cause problems if your utility is expecting ASCII text files.

Your input file is most likely UTF-16. MORE is converting the UTF-16 into ANSI (extended ASCII) format, which does effectively remove the NULL bytes. It also converts non-ASCII values into extended ASCII characters in the decimal 128 - 255 byte value range. I believe it uses your active code page (CHCP) value to figure out what characters map where, but I'm not positive.

You should be aware of some additional issues.

  • MORE will convert all Tab characters into a series of spaces, and you cannot control how many spaces (it varies depending on the current position in the line).

  • MORE will always terminate each line with \r\n (carriage return and line feed).

  • MORE also removes the two byte BOM at the beginning of the file, if it exists. The BOM indicates the UTF-16 format. But MORE does not require the 2 byte BOM indicator, it will convert the UTF-16 to ANSI regardless.

  • Lastly MORE can hang indefinitely if your file exceeds 64K lines.

If MORE works for you, than by all means use it.

One other option is to use TYPE, which will also convert UTF-16 to ANSI:

type "yourFile.txt" >"newFile.txt"

TYPE definitely maps non-ASCII codes based on the active code page.

There are some differences with how TYPE converts vs. MORE

  • One advantage of TYPE is it does not convert Tab characters to spaces.

  • Another advantage is it will not hang with large files.

  • Another difference (maybe good, maybe bad) is it will not add a line terminator to a line that does not already have one.

  • A potential disadvantage of TYPE is it will not convert UTF-16 to ANSI if the input is missing the BOM.



来源:https://stackoverflow.com/questions/34378907/removing-binary-control-characters-from-a-text-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!