how to check if a string looks randomized, or human generated and pronouncable?

前端 未结 10 1621
旧巷少年郎
旧巷少年郎 2020-12-13 03:55

For the purpose of identifying [possible] bot-generated usernames.

Suppose you have a username like \"bilbomoothof\" .. it may be nonsense, but it still contains pro

10条回答
  •  难免孤独
    2020-12-13 04:28

    Off the top of my head, you could look for syllables, making use of soundex. That's the direction I would explore, based on the assumption that a pronounceable word has at least one syllable.

    EDIT: Here's a function for counting syllables:

    function count_syllables($word) {
     
    $subsyl = Array(
    'cial'
    ,'tia'
     ,'cius'
     ,'cious'
     ,'giu'
     ,'ion'
     ,'iou'
     ,'sia$'
     ,'.ely$'
     );
      
     $addsyl = Array(
     'ia'
     ,'riet'
     ,'dien'
     ,'iu'
     ,'io'
     ,'ii'
     ,'[aeiouym]bl$'
     ,'[aeiou]{3}'
     ,'^mc'
     ,'ism$'
     ,'([^aeiouy])\1l$'
     ,'[^l]lien'
     ,'^coa[dglx].'
     ,'[^gq]ua[^auieo]'
     ,'dnt$'
     );
      
     // Based on Greg Fast's Perl module Lingua::EN::Syllables
     $word = preg_replace('/[^a-z]/is', '', strtolower($word));
     $word_parts = preg_split('/[^aeiouy]+/', $word);
     foreach ($word_parts as $key => $value) {
     if ($value <> '') {
     $valid_word_parts[] = $value;
     }
     }
      
     $syllables = 0;
     // Thanks to Joe Kovar for correcting a bug in the following lines
     foreach ($subsyl as $syl) {
     $syllables -= preg_match('~'.$syl.'~', $word);
     }
     foreach ($addsyl as $syl) {
     $syllables += preg_match('~'.$syl.'~', $word);
     }
     if (strlen($word) == 1) {
     $syllables++;
     }
     $syllables += count($valid_word_parts);
     $syllables = ($syllables == 0) ? 1 : $syllables;
     return $syllables;
     }
    

    From this very interesting link:

    http://www.addedbytes.com/php/flesch-kincaid-function/

提交回复
热议问题