Unable to encode to iso-8859-1 encoding for some chars using Perl Encode module

烂漫一生 提交于 2020-01-05 17:31:21

问题


I have a HTML string in ISO-8859-1 encoding. I need to pass this string to HTML:Entities::decode_entities() for converting some of the HTML ASCII codes to respective chars. To so i am using a module HTML::Parser::Entities 3.65 but after decode_entities() operation my whole string changes to utf-8 string. This behavior seems fine as the documentation of the HTML::Parse. As i need this string back in ISO-8859-1 format for further processing so i have used Encode::encode("iso-8859-1",$str) to change the string back to ISO-8859-1 encoding. My results are fine excepts for some chars, a question mark is coming instead. One example is single quote ' ASCII code (’)

Can anybody help me if there any limitation of Encode module? Any other pointer will also be helpful to solve the problem. I am pasting the sample text having the char causing the issue:

my $str = "This is a test string to test the encoding of some chars like ’ “ ” etc these are failing to encode; some of them which encode correctly are é « etc.";

Thanks


回答1:


The fundamental problem is that the characters represented by ’, “, and ” do not exist in ISO-8859-1. You'll have to decide what it is that you want to do with them.

Some possibilities:

Use cp1252, Microsoft's "extended" version of ISO-8859-1, instead of the real thing. It does include those characters.

Re-encode the entities outside the ISO-8859-1 range (plus &), before converting from utf-8 to ISO-8859-1:

my $toEncode = do { no warnings 'utf8'; "&\x{0100}-\x{10FFFF}" };
$string = HTML::Entities::encode_entities($string, $toEncode);

(The no warnings bit is needed because U+10FFFF hasn't actually been assigned yet.)

There are other possibilities. It really depends on what you're trying to accomplish.




回答2:


There's a third argument to encode, which controls the checking it does. The default is to use a substitution character, but you can set it to FB_CROAK to get an error message.



来源:https://stackoverflow.com/questions/2963510/unable-to-encode-to-iso-8859-1-encoding-for-some-chars-using-perl-encode-module

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!