How to convert “binary text” to “visible text”?

别来无恙 提交于 2021-02-05 06:59:45

问题


I have a text file full of non-ASCII characters. I can not detect the encoding by either file or enca.

file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text

enca non_ascii.txt
Unrecognized encoding

But I can open it normally in Windows Notepad++

Edit: The expression above leads misunderstanding. Sorry for this. In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.

The 2 parts shows as below. They are decoded in 2 different ways by notepad++.

Question:

  1. How could I detect the files encoding under linux?
  2. how do I recover the characters represented by <F1><EE><E9><E4><FF>? I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?

The file content slice as follows:

less non_ascii.txt
"non_ascii.txt" may be a binary file.  See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>

回答1:


Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?

The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.

A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!

With that out of the way, converting is easy.

iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt

Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.

The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.

Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.



来源:https://stackoverflow.com/questions/33558075/how-to-convert-binary-text-to-visible-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!