how to check if a string looks randomized, or human generated and pronouncable?

前端 未结 10 1608
旧巷少年郎
旧巷少年郎 2020-12-13 03:55

For the purpose of identifying [possible] bot-generated usernames.

Suppose you have a username like \"bilbomoothof\" .. it may be nonsense, but it still contains pro

10条回答
  •  无人及你
    2020-12-13 04:21

    I dont know of existing algorithms for this problem, but I think it can be attacked in any one of the following ways:

    • your bot may be rubbish, but you can keep a list of syllables, or more specifically, phonemes, that you can try finding in your given string. But this sounds a bit difficult becasuse you would need to segment the string in different places etc.
    • there are 5 vowels in the english alphabet, and 21 others. You could assume that if they were randomly generated, then approximately you would expect 5/26*W, (where W is word length) letters that are vowels, and significant deviations from this could be suspicious. (If letter are included then 5/31 and so on..) You can try building on this idea by searching for doubletons, and trying to make sure that each doubleton occurs with same probability etc.
    • further, you can try to segment your input string around vowels, example three lettters before a vowel and three letters after a vowel, and try to find out if it make a recognizable sound by comparing with phonemes.

提交回复
热议问题