how to check if a string looks randomized, or human generated and pronouncable?

前端 未结 10 1602
旧巷少年郎
旧巷少年郎 2020-12-13 03:55

For the purpose of identifying [possible] bot-generated usernames.

Suppose you have a username like \"bilbomoothof\" .. it may be nonsense, but it still contains pro

相关标签:
10条回答
  • 2020-12-13 04:21

    Just use CAPTCHA as a part of the registration process.

    You can never distinguish real uesrnames from bot-created usernames, without severely annoying your users.

    You will block users with bizzare, or non-English names, which will irritate them, and the bots will just keep trying until they catch a good username (from dictionary, or other sources - this is a very nice one, by the way!).

    EDIT : Looking for prevention rather than after-the-fact analysis?

    The solution is letting somebody else manage user's identities for you. For instance, you can use a small list of OpenID providers (like SO), or facebook connect, or both. You'll know for sure that the users are real, and that they have been solving at least one CAPTCHA.

    EDIT: Another Idea

    Search the string in Google, and check the number of matches found. Shouldn't be your only tool, but it is a good indicator, too. Randomized strings, of course, should have little or no matches.

    0 讨论(0)
  • 2020-12-13 04:21

    I dont know of existing algorithms for this problem, but I think it can be attacked in any one of the following ways:

    • your bot may be rubbish, but you can keep a list of syllables, or more specifically, phonemes, that you can try finding in your given string. But this sounds a bit difficult becasuse you would need to segment the string in different places etc.
    • there are 5 vowels in the english alphabet, and 21 others. You could assume that if they were randomly generated, then approximately you would expect 5/26*W, (where W is word length) letters that are vowels, and significant deviations from this could be suspicious. (If letter are included then 5/31 and so on..) You can try building on this idea by searching for doubletons, and trying to make sure that each doubleton occurs with same probability etc.
    • further, you can try to segment your input string around vowels, example three lettters before a vowel and three letters after a vowel, and try to find out if it make a recognizable sound by comparing with phonemes.
    0 讨论(0)
  • 2020-12-13 04:28

    Off the top of my head, you could look for syllables, making use of soundex. That's the direction I would explore, based on the assumption that a pronounceable word has at least one syllable.

    EDIT: Here's a function for counting syllables:

    function count_syllables($word) {
     
    $subsyl = Array(
    'cial'
    ,'tia'
     ,'cius'
     ,'cious'
     ,'giu'
     ,'ion'
     ,'iou'
     ,'sia$'
     ,'.ely$'
     );
      
     $addsyl = Array(
     'ia'
     ,'riet'
     ,'dien'
     ,'iu'
     ,'io'
     ,'ii'
     ,'[aeiouym]bl$'
     ,'[aeiou]{3}'
     ,'^mc'
     ,'ism$'
     ,'([^aeiouy])\1l$'
     ,'[^l]lien'
     ,'^coa[dglx].'
     ,'[^gq]ua[^auieo]'
     ,'dnt$'
     );
      
     // Based on Greg Fast's Perl module Lingua::EN::Syllables
     $word = preg_replace('/[^a-z]/is', '', strtolower($word));
     $word_parts = preg_split('/[^aeiouy]+/', $word);
     foreach ($word_parts as $key => $value) {
     if ($value <> '') {
     $valid_word_parts[] = $value;
     }
     }
      
     $syllables = 0;
     // Thanks to Joe Kovar for correcting a bug in the following lines
     foreach ($subsyl as $syl) {
     $syllables -= preg_match('~'.$syl.'~', $word);
     }
     foreach ($addsyl as $syl) {
     $syllables += preg_match('~'.$syl.'~', $word);
     }
     if (strlen($word) == 1) {
     $syllables++;
     }
     $syllables += count($valid_word_parts);
     $syllables = ($syllables == 0) ? 1 : $syllables;
     return $syllables;
     }
    

    From this very interesting link:

    http://www.addedbytes.com/php/flesch-kincaid-function/

    0 讨论(0)
  • 2020-12-13 04:30

    You could use a neural network to evaluate whether the nickname looks like a natural-language nickname.

    Assemble two data-sets: one of valid nicknames, and one of bogus-generated ones. Train a simple back-progating single hidden layer neural network with the character values as inputs. The neural network will learn to discriminate between strings like "zrgssgbt" and "zargbyt", since the latter has consonants and vowels intermingled .

    It is important to use real-world examples to get a good discriminator.

    0 讨论(0)
提交回复
热议问题