Searching text files' contents with various encodings with Python?

谁都会走 提交于 2019-12-06 09:43:50

The chardet package exists for a reason (and was ported from some older Netscape code, for a similar reason) : detecting the encoding of an arbitrary text file is tricky.

There are two basic alternatives :

  1. Use some hard-coded rules to determine whether a file has a certain encoding. For example, you could look for the UTF byte-order marker at the beginning of the file. This breaks for encodings that overlap significantly in their use of different bytes, or for files that don't happen to use the "marker" bytes that your detection rules use.

  2. Take a database of files in known encodings and count up the distributions of different bytes (and byte pairs, triplets, etc.) in each of the encodings. Then, when you have a file of unknown encoding, take a sample of its bytes and see which pattern of byte usage is the best match. This breaks when you have short test files (which makes the frequency estimates inaccurate), or when the usage of the bytes in your test file doesn't match the usage in the file database you used to build up your frequency data.

The reason notepad++ can do character detection (as well as web browsers, word processors, etc.) is that these programs all have one or both of these methods built in to the program. Python doesn't build this into its interpreter -- it's a general-purpose programming language, not a text editor -- but that's just what the chardet package does.

I would say that because you know some things about the text files that you're handling, you might be able to take a few shortcuts. For example, are your log files all in one of either encoding A or encoding B ? If so, then your decision is much simpler, and probably either the frequency-based or the rule-based approach above would be pretty straightforward to implement on your own. But if you need to detect arbitrary character sets, I'd highly recommend building on the shoulders of giants.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!