Perl print matched content only

那年仲夏 提交于 2019-12-12 01:28:00

问题


I am developing a web crawler in Perl. It extracts contents from the page and then a pattern match is done to check the language of the content. Unicode values are used to match the content.

Sometimes the extracted content contains text in multiple languages. The pattern match I used here prints all the text, but I want to print only the text that matches the Unicode values specified in the pattern.

my $uu         = LWP::UserAgent->new('Mozilla 1.3');
my $extractorr = HTML::ContentExtractor->new();

# create response object to get the url
my $responsee = $uu->get($url);
my $contentss = $responsee->decoded_content();

$range = "([\x{0C00}-\x{0C7F}]+)";    # match particular language

if ($contentss =~ m/$range/) {
  $extractorr->extract($url, $contentss);
  print "$url\n";
  binmode(STDOUT, ":utf8");
  print $extractorr->as_text;
}

回答1:


It would be better to match characters with a particular Unicode property, rather than trying to formulate an appropriate character class.

The code points in the range 0x0C00...0x0C7F correspond to characters in Telugu (one of the Indian languages) which you can match using the regex /\p{Telugu}/.

The other properties you will probably need are /\p{Kannada}/, /\p{Malayalam}/, /\p{Devanagari}/, and /\p{Tamil}/



来源:https://stackoverflow.com/questions/19157269/perl-print-matched-content-only

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!