Match and replace emoticons in string - what is the most efficient way?

问题

Wikipedia defines a lot of possible emoticons people can use. I want to match this list to words in a string. I now have this:

$string = "Lorem ipsum :-) dolor :-| samet";
$emoticons = array(
  '[HAPPY]' => array(' :-) ', ' :) ', ' :o) '), //etc...
  '[SAD]'   => array(' :-( ', ' :( ', ' :-| ')
);
foreach ($emoticons as $emotion => $icons) {
  $string = str_replace($icons, " $emotion ", $string);
}
echo $string;

Output:

Lorem ipsum [HAPPY] dolor [SAD] samet

so in principle this works. However, I have two questions:

As you can see, I'm putting spaces around each emoticon in the array, such as ' :-) ' instead of ':-)' This makes the array less readable in my opinion. Is there a way to store emoticons without the spaces, but still match against $string with spaces around them? (and as efficiently as the code is now?)
Or is there perhaps a way to put the emoticons in one variable, and explode on space to check against $string? Something like

$emoticons = array( '[HAPPY]' => ">:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^)", '[SAD]' => ":'-( :'( :'-) :')" //etc...
Is str_replace the most efficient way of doing this?

I'm asking because I need to check millions of strings, so I'm looking for the most efficient way to save processing time :)

回答1:

If the $string, in which you want replace emoticons, is provided by a visitor of your site(I mean it's a user's input like comment or something), then you should not relay that there will be a space before or after the emoticon. Also there are at least couple of emoticons, that are very similar but different, like :-) and :-)). So I think that you will achieve better result if you define your emoticon's array like this:

$emoticons = array(
    ':-)' => '[HAPPY]',
    ':)' => '[HAPPY]',
    ':o)' => '[HAPPY]',
    ':-(' => '[SAD]',
    ':(' => '[SAD]',
    ...
)

And when you fill all find/replace definitions, you should reorder this array in a way, that there will be no chance to replace :-)) with :-). I believe if you sort array values by length will be enough. This is in case your are going to use str_replace(). strtr() will do this sort by length automatically!

If you are concerned about performance, you can check strtr vs str_replace, but I will suggest to make your own testing (you may get different result regarding your $string length and find/replace definitions).

The easiest way will be if your "find definitions" doesn't contain trailing spaces:

$string = strtr( $string, $emoticons );
$emoticons = str_replace( '][', '', trim( join( array_unique( $emoticons ) ), '[]' ) );
$string = preg_replace( '/\s*\[(' . join( '|', $emoticons ) . ')\]\s*/', '[$1]', $string ); // striping white spaces around word-styled emoticons

回答2:

Here’s the idea using the Perl 3rd-party Regexp::Assemble module from CPAN. For example, given this program:

#!/usr/bin/env perl
use utf8;
use strict;
use warnings;

use Regexp::Assemble;

my %faces = (
    HAPPY => [qw¡ :-) :) :o) :-} ;-} :-> ;-} ¡],
    SAD   => [qw¡ :-( :( :-| ;-) ;-( ;-< |-{ ¡],
);

for my $name (sort keys %faces) {
    my $ra = Regexp::Assemble->new();
    for my $face (@{ $faces{$name} }) {
        $ra->add(quotemeta($face));
    }
    printf "%-12s => %s\n", "[$name]", $ra->re;
}

It will output this:

[HAPPY]      => (?-xism:(?::(?:-(?:[)>]|\})|o?\))|;-\}))
[SAD]        => (?-xism:(?::(?:-(?:\||\()|\()|;-[()<]|\|-\{))

There’s a bit of extra stuff there you don’t really probably need, so those would reduce to just:

[HAPPY]      => (?:-(?:[)>]|\})|o?\))|;-\}
[SAD]        => (?:-(?:\||\()|\()|;-[()<]|\|-\{

or so. You could build that into your Perl program to trim the extra bits. Then you could place the righthand sides straight into your preg_replace.

The reason I did the use utf8 was so I could use ¡ as my qw// delimiter, because I didn’t want to mess with escaping things inside there.

You wouldn’t need to do this if the whole program were in Perl, because modern versions of Perl already know to do this for you automatically. But it’s still useful to know how to use the module so you can generate patterns to use in other languages.

回答3:

This sounds like a good application for regular expressions, which are a tool for fuzzy text matching and replacement. str_replace is a tool for exact text search and replace; regexps will let you search for entire classes of "text that looks something like this", where the this is defined in terms of what kinds of characters you will accept, how many of them, in what order, etc.

If you use regular expressions, then...

The \s wildcard will match whitespace, so you can match \s$emotion\s.

(Also consider the case where the emoticon occurs at the end of a string - i.e. that was funny lol :) - you can't always assume emoticons will have spaces around them. You can write a regexp that handles this.)
You can write a regular expression that will match any of the emoticons in the list. You do this using the alternation symbol |, which you can read as an OR symbol. The syntax is (a|b|c) to match pattern a OR b OR c.

For example (:\)|:-\)|:o\)) will match any of :),:-),:o). Note that I had to escape the )'s because they have a special meaning inside regexps (parentheses are used as a grouping operator.)
Premature optimisation is the root of all evil.

Try the most obvious thing first. If that doesn't work, you can optimise it later (after you profile the code to ensure this is really going to give you a tangible performance benefit.)

If you want to learn regular expressions, try Chapter 8 of the TextWrangler manual. It's a very accessible introduction to the uses and syntax of regular expressions.

Note: my advice is programming-language independent. My PHP-fu is much weaker than my Python-fu, so I can't provide sample code. :(

回答4:

Intro Comment: Please only ask one question at once. You'll get better answers than. Next to that, you can't get good performance advice if you don't show us the metrics you've done so far.

From what I can see from your code is that you do two times a string processing you could save, putting the replacement into spaces in specific. You could unroll it with your definition first:

$emoticons = array(
  ' [HAPPY] ' => array(' :-) ', ' :) ', ' :o) '), //etc...
  ' [SAD] '   => array(' :-( ', ' :( ', ' :-| ')
);

foreach ($emoticons as $replace => $search)
{
  $string = str_replace($search, $replace, $string);
}

This will save you some fractions of a microsecond each time you call that which, well give you better performance you'll probably not notice. Which brings me to the point that you should probably write this in C and compile it.

A bit closer to C would be using a regular expression compiled once and then re-used, which has been suggested in another answer already. The benefit here is that you might have the fastest way you can do it with PHP if you run the same expression multiple times and you could generate the regular expression upfront, so you can store it in a format that is easier for you to edit. You could then cache the regular expression in case you would need to even need to tweak performance that hardly.

1. As you can see, I'm putting spaces around each emoticon in the array, such as ' :-) ' instead of ':-)' This makes the array less readable in my opinion. Is there a way to store emoticons without the spaces, but still match against $string with spaces around them? (and as efficiently as the code is now?)

Yes this is possible but not more efficiently in the sense that you would need to further process the configuration data into the replacement data. No idea about which kind of efficiency you really talk, but I assume the later, so the answer is, possible but not suitable for your very special use-case. Normally I would prefer something that's easier to edit, so to say you're more efficient to deal with it instead of caring about processing speed, because processing speed can be fairly well shorten by distributing the processing across multiple computers.

2. Or is there perhaps a way to put the emoticons in one variable, and explode on space to check against $string? Something like

$emoticons = array( '[HAPPY]' => ">:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^)", '[SAD]' => ":'-( :'( :'-) :')" //etc...

Sure, that's possible but you run into the same issues as in 1.

3. Is str_replace the most efficient way of doing this?

Well right now with the code you've offered it's the only way you ask about. As there is no alternative you tell us about, it's at least working for you which at this point in time is the most efficient way of doing that for you. So right now, yes.

回答5:

I would start trying out the simplest implementation first, using str_replace and those arrays with spaces. If the performance is unacceptable, try a single regular expression per emotion. That compresses things quite a bit:

$emoticons = array(
  '[HAPPY]' => ' [:=]-?[\)\]] ', 
  '[SAD]'   => ' [:=]-?[\(\[\|] '
);

If performance is still unacceptable, you can use something fancier, like a suffix tree (see: http://en.wikipedia.org/wiki/Suffix_tree ), which allows you to scan the string only once for all emoticons. The concept is simple, you have a tree whose root is a space (since you want to match a space before the emoticon), the first children are ':' and '=', then children of ':' are ']', ')', '-', etc. You have a single loop that scans the string, char by char. When you find a space, you move to the next level in the tree, then see if the next character is one of the nodes at that level (':' or '='), if so, move to the next level, etc. If, at any point, the current char is not a node in the current level, you go back to root.

来源：https://stackoverflow.com/questions/9295896/match-and-replace-emoticons-in-string-what-is-the-most-efficient-way

标签

php

regex

performance

string-matching

suffix-tree