How to detect Arabic chars using perl regex?

爱⌒轻易说出口 提交于 2019-12-06 07:38:27
mob

EDIT (as I have obviously wandered into tchrist's area of expertise). Skip using $response->content, which always returns a raw byte string, and use $response->decoded_content, which applies any decoding hints it gets from the response headers.


The page you are downloading is UTF-8 encoded, but you are not reading it as UTF-8 (in fairness, there are no hints on the page about what the encoding is [update: the server does return the header Content-Type: text/html; charset=utf-8, though]).

You can see if this if you examine $response->content:

use List::Util 'max';
my $max_ord = max map{ord}split //, $response->content;
print "max ord of response content is $max_ord\n";

If you get a value less than 256, then you are reading this content in as raw bytes, and your strings will never match /\p{Arabic}/. You must decode the input as UTF-8 before you apply the regex:

use Encode;
my $content = decode('utf-8', $response->content);
# now check  $content =~ /\p{Arabic}/

Sometimes (and now I am wading well outside my area of expertise) the page you are loading contains hints about how it is decoded, and $response->content may already be decoded correctly. In that case, the decode call above is unnecessary and may be harmful. See other SO posts on detecting the encoding of an arbitrary string.

If you're using Perl, you should be able to use the Unicode script matching operator. /\p{Arabic}/

If that doesn't work, you'll have to look up the range of Unicode characters for Arabic, and test them something like this /[\x{0600}\x{0601}...\x{06FF}]/.

msoutopico

Just for the record, at least in .NET regexps, you need to use \p{IsArabic}.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!