The proper way of encoding detection in perl

你说的曾经没有我的故事 提交于 2019-12-07 09:25:02

问题


I've got these two strings:

%EC%E0%EC%E0+%EC%FB%EB%E0+%F0%E0%EC%F3
%D0%BC%D0%B0%D0%BC%D0%B0%20%D0%BC%D1%8B%D0%BB%D0%B0%20%D1%80%D0%B0%D0%BC%D1%83

This is a url-encoded phrase in Russian in cp-1251 and utf-8 respectively. I want to see them in Russian in my utf-8 terminal using perl. Unfortunately, perl module Encode::Detect (after url-decoding) can't detect cp-1251 of the first example. Instead, it proposes this: "x-euc-tw".

The question is, what is the proper way of detecting the right encoding in this case (specifying locale parameters, using other modules...)?


回答1:


Are UTF-8 and cp1251 the only two options? The odds of having cp1251 text that's also valid UTF-8 is extremely tiny. (It would be gibberish.) So you can do

use Encode qw( decode );
my $decoded = eval { decode('UTF-8', $encoded, Encode::FB_CROAK) }
    // decode('cp1251', $encoded);

This will be far far more accurate that an encoding guesser.




回答2:


Encode::Detect, which uses the Mozilla universal character set detector, works by letting different character set probers look at the data. The probers then report different confidence levels and the prober with highest confidence wins. This process depends on the input only; it is not affected by locale or other external settings. In this case, for whatever reason, the prober for euc-tw is reporting a higher confidence than the prober for windows-1251, and there's nothing you can do short of changing the data or modifying the source code.

You could try using Encode::Guess which allows specifying a list of encodings to choose from.



来源:https://stackoverflow.com/questions/11691554/the-proper-way-of-encoding-detection-in-perl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!