Neural networks for email spam detection

前端 未结 4 842
半阙折子戏
半阙折子戏 2020-12-23 18:07

Let\'s say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups

  • genuine email
4条回答
  •  抹茶落季
    2020-12-23 18:19

    If you insist on NNs... I would calculate some features for every email

    Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):

    1. Total no of characters (C)
    2. Total no of alpha chars / C Ratio of alpha chars
    3. Total no of digit chars / C
    4. Total no of whitespace chars/C
    5. Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
    6. Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )
    7. Total no of words (M)
    8. Total no of short words/M Two letters or less
    9. Total no of chars in words/C
    10. Average word length
    11. Avg. sentence length in chars
    12. Avg. sentence length in words
    13. Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
    14. Type Token Ratio No. Of unique Words/ M
    15. Hapax Legomena Freq. of once-occurring words
    16. Hapax Dislegomena Freq. of twice-occurring words
    17. Yule’s K measure
    18. Simpson’s D measure
    19. Sichel’s S measure
    20. Brunet’s W measure
    21. Honore’s R measure
    22. Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – “ « » < > [ ] { }

    You could also add some more features based on the formatting: colors, fonts, sizes, ... used.

    Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).

    So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.

    The inputs would need to be normalized according to your current pre-classified corpus.

    I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.

提交回复
热议问题