Regular expression to search for Gadaffi

后端 未结 15 1765
鱼传尺愫
鱼传尺愫 2020-12-07 07:05

I\'m trying to search for the word Gadaffi. What\'s the best regular expression to search for this?

My best attempt so far is:

\\b[KG]h?add?af?fi$\\         


        
相关标签:
15条回答
  • 2020-12-07 07:10

    One interesting thing to note from your list of potential spellings is that there's only 3 Soundex values for the contained list (if you ignore the outlier 'Kazzafi')

    G310, K310, Q310

    Now, there are false positives in there ('Godby' also is G310), but by combining the limited metaphone hits as well, you can eliminate them.

    <?
    $soundexMatch = array('G310','K310','Q310');
    $metaphoneMatch = array('KTF','KTHF','FTF','KHTF','K0F');
    
    $text = "This is a big glob of text about Mr. Gaddafi. Even using compound-Khadafy terms in here, then we might find Mr Qudhafi to be matched fairly well. For example even with apostrophes sprinkled randomly like in Kad'afi, you won't find false positives matched like godfrey, or godby, or even kabbadi";
    
    $wordArray = preg_split('/[\s,.;-]+/',$text);
    foreach ($wordArray as $item){
        $rate = in_array(soundex($item),$soundexMatch) + in_array(metaphone($item),$metaphoneMatch);
        if ($rate > 1){
            $matches[] = $item;
        }
    }
    $pattern = implode("|",$matches);
    $text = preg_replace("/($pattern)/","<b>$1</b>",$text);
    echo $text;
    ?>
    

    A few tweaks, and lets say some cyrillic transliteration, and you'll have a fairly robust solution.

    0 讨论(0)
  • 2020-12-07 07:12

    I think you're over complicating things here. The correct regex is as simple as:

    \u0627\u0644\u0642\u0630\u0627\u0641\u064a
    

    It matches the concatenation of the seven Arabic Unicode code points that forms the word القذافي (i.e. Gadaffi).

    0 讨论(0)
  • 2020-12-07 07:16

    [GQK][ahu]+[dtez]+\'?[adhz]+f{1,2}(i|y)

    In parts:

    • [GQK]
    • [ahu]+
    • [dtez]+
    • \'?
    • [adhz]+
    • f{1,2}(i|y)

    Note: Just wanted to give a shot at this.

    0 讨论(0)
  • 2020-12-07 07:17

    Easy... (Qadaffi|Khadafy|Qadafi|...)... it's self-documented, maintainable, and assuming your regexp engine actually compiles regular expressions (rather than interpreting them), it will compile to the same DFA that a more obfuscated solution would.

    Writing compact regular expressions is like using short variable names to speed up a program. It only helps if your compiler is brain-dead.

    0 讨论(0)
  • 2020-12-07 07:22

    \b[KGQ]h?add?h?af?fi\b

    Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).

    Btw, why is there a $ at the end of the regex?


    Btw, nice article on the topic:

    Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader’s name spelled so many different ways?.


    EDIT

    To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D

    \b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b
    
    0 讨论(0)
  • 2020-12-07 07:24

    Why not do a mixed approach? Something between a list of all possibilities and a complicated Regex that matches far too much.

    Regex is about pattern matching and I can't see a pattern for all variants in the list. Trying to do so, will also find things like "Gazzafy" or "Quud'haffi" which are most probably not a used variant and definitly not on the list.

    But I can see patterns for some of the variants, and so I ended up with this:

    \b(?:Gheddafi|Gathafi|Kazzafi|Kad'afi|Qadhdhafi|Qadthafi|Qudhafi|Qu?athafi|[KG]h?add?h?aff?[iy]|Qad[dh]?afi)\b
    

    At the beginning I list the ones where I can't see a pattern, then followed by some variants where there are patterns.

    See it here on www.rubular.com

    0 讨论(0)
提交回复
热议问题