How to detect emoji as unicode in Perl?

偶尔善良 提交于 2020-01-04 21:38:45

问题


i have text file that contain emoji unicode caracter for exemple 😤, ☹️, 😔, 😅, 😃, 😉, 😜, 😍.

For example the code \N{1F60D} correspond to 😍 I use recommendation as in https://perldoc.perl.org/perluniintro.html section Creating Unicode. My program must detect them and do some treatments, but if i use

open(FIC1, ">$fic");

while (<FIC>) {
my $ligne=$_;

if( $ligne=~/\N{1F60D}/  )
{print "heart ";
    }
}

Now I do this, it work

open(FIC1, ">$fic");

while (<FIC>) {
my $ligne=$_;

if( $ligne=~/😍/  )
{print "Heart ";
    }
}

What is the problem with the first code Regards


回答1:


If you look at perldoc perlre for \N, you see that it means "named Unicode character or character sequence".

You can use this instead:

if ($ligne =~ m/\N{U+1F60D}/)
# or
if ($ligne =~ m/\x{1F60D}/)

Edit: It's also described in the link you posted, https://perldoc.perl.org/perluniintro.html

Edit: The content you read is probably not decoded. You want:

use Encode;
...
my $ligne = decode_utf8 $_;

or simply open the file directly in utf8 mode:

open my $fh, "<:encoding(UTF-8)", $filename or die "Could not open $filename: $!";
while (my $ligne = <$fh>) {
    if ($ligne =~ m/\N{U+1F60D}/) { ... }
}

You never showed how you open the filehandle called FIC, so I assumed it was utf8 decoded. Here is another good tutorial about unicode in perl: https://perlgeek.de/en/article/encodings-and-unicode




回答2:


For detecting emoji, I would use unicode properties in regexes, e.g.:

  • \p{Emoticons} or
  • \p{Block: Emoticons}

For example, print out only emoji

perl -CSDA -nlE 'say for( /(\p{Emoticons})/g )' <<< 'abc😦😧😮αβγ'

will print

😦
😧
😮

For more info see perluniprops




回答3:


use perl -C can be used to enable unicode features

perl -C -E 'say "\N{U+263a}"'|perl -C -ne 'print if /\N{U+263a}/'

from perl run

-C [number/list]

The -C flag controls some of the Perl Unicode features. ...

The reason why the second code works is that perl matches UTF-8 binary sequence: as in perl -ne 'print if /\xf0\x9f\x98\x8d/'.

Following should work

#!/usr/bin/perl -C
open(FIC1, ">$fic");

while (<FIC>) {
    my $ligne=$_;

    if( $ligne=~/\N{U+1F60D}/  ) {
        print "heart ";
    }
}


来源:https://stackoverflow.com/questions/47924985/how-to-detect-emoji-as-unicode-in-perl

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!