For the purpose of identifying [possible] bot-generated usernames.
Suppose you have a username like \"bilbomoothof\" .. it may be nonsense, but it still contains pro
I guess you could think of something like that if you could restrict yourself to pronounceable sounds in english. For me (I am French), words like szczepan or wawrzyniec are unpronounceable and certainly have a certain randomness.
But they are actually Polish first names (meaning steven and lawrence)...
I agree with Mac. But more than that, people sometimes have user name that aren't pronouncable, like qwerty or rtfmorleave.
Why bother with that ?
< obsolete and false, but i don't delete because of comments >
But more than that, no bots use 'zetztzgsd' as user name, they have dictionnary of realname, possible nick name, etc. so I think this would be a waster of time for you
< / obsolete and false, but i don't delete because of comments>
Look up n-gram analysis. It is successfully used to automatically detect text language and works surprisingly well even on very short texts.
The online demo (no longer online) recognized 'bilbomoothof' as English and 'sdfgbhm342r3f' as Nepali. It probably always returns the best match, even if it's a very poor one. I think you could train it to discern between 'pronounceable' and 'random'.
Unfortunately this cannot be done, since Kolmogorov complexity function is not computable, therefore you cannot generate such algorithm unless you will apply some rules to domain of possible user names, then you will be able to perform heuristic analysis and decide, but even then it's really hard to do.
PS: After posted this answer, I bumped into some service which gave an idea of example for user name domain restriction, let to the users use the mail box of well known public domain as they user names.
In Russian, we have forbidden syllables, like ГЙ
, а Ъ
or Ь
after a vowel and so on.
However, spam bots just use the names database, that's why my spam inbox is full of strange names you can only meet in history books.
I expect English to have syllable distribution histograms too (like ETAOIN SHRDLU
, but for two-letter or even three-letter syllables), and having critical density of low frequency syllables in one name is certainly a sign.
Note that many large sites suggest usernames like [first init][middle init][last name][number]. The users then carry these usernames over to other sites, and the first three letters are definitely not pronounceable.