For the purpose of identifying [possible] bot-generated usernames.
Suppose you have a username like \"bilbomoothof\" .. it may be nonsense, but it still contains pro
Just use CAPTCHA as a part of the registration process.
You can never distinguish real uesrnames from bot-created usernames, without severely annoying your users.
You will block users with bizzare, or non-English names, which will irritate them, and the bots will just keep trying until they catch a good username (from dictionary, or other sources - this is a very nice one, by the way!).
EDIT : Looking for prevention rather than after-the-fact analysis?
The solution is letting somebody else manage user's identities for you. For instance, you can use a small list of OpenID providers (like SO), or facebook connect, or both. You'll know for sure that the users are real, and that they have been solving at least one CAPTCHA.
EDIT: Another Idea
Search the string in Google, and check the number of matches found. Shouldn't be your only tool, but it is a good indicator, too. Randomized strings, of course, should have little or no matches.
I dont know of existing algorithms for this problem, but I think it can be attacked in any one of the following ways:
Off the top of my head, you could look for syllables, making use of soundex. That's the direction I would explore, based on the assumption that a pronounceable word has at least one syllable.
EDIT: Here's a function for counting syllables:
function count_syllables($word) {
$subsyl = Array(
'cial'
,'tia'
,'cius'
,'cious'
,'giu'
,'ion'
,'iou'
,'sia$'
,'.ely$'
);
$addsyl = Array(
'ia'
,'riet'
,'dien'
,'iu'
,'io'
,'ii'
,'[aeiouym]bl$'
,'[aeiou]{3}'
,'^mc'
,'ism$'
,'([^aeiouy])\1l$'
,'[^l]lien'
,'^coa[dglx].'
,'[^gq]ua[^auieo]'
,'dnt$'
);
// Based on Greg Fast's Perl module Lingua::EN::Syllables
$word = preg_replace('/[^a-z]/is', '', strtolower($word));
$word_parts = preg_split('/[^aeiouy]+/', $word);
foreach ($word_parts as $key => $value) {
if ($value <> '') {
$valid_word_parts[] = $value;
}
}
$syllables = 0;
// Thanks to Joe Kovar for correcting a bug in the following lines
foreach ($subsyl as $syl) {
$syllables -= preg_match('~'.$syl.'~', $word);
}
foreach ($addsyl as $syl) {
$syllables += preg_match('~'.$syl.'~', $word);
}
if (strlen($word) == 1) {
$syllables++;
}
$syllables += count($valid_word_parts);
$syllables = ($syllables == 0) ? 1 : $syllables;
return $syllables;
}
From this very interesting link:
http://www.addedbytes.com/php/flesch-kincaid-function/
You could use a neural network to evaluate whether the nickname looks like a natural-language nickname.
Assemble two data-sets: one of valid nicknames, and one of bogus-generated ones. Train a simple back-progating single hidden layer neural network with the character values as inputs. The neural network will learn to discriminate between strings like "zrgssgbt" and "zargbyt", since the latter has consonants and vowels intermingled .
It is important to use real-world examples to get a good discriminator.