Easiest Way to Check if a Java String Instance Might Hold Spam Data

本小妞迷上赌 提交于 2020-01-05 02:34:38

问题


I have a process which iterates String instances. Each iteration does few operations on the String instance. At the end the String instance is persisted.

Now, I want to add for each iteration a check if the String instance might be spam. I only have to verify that the String instance is not "adult materials" spam.

Any recommendations?


回答1:


This is a very hard problem that the industry is constantly trying to solve. The best thing for you to do is to try and use an existing solution like Classifier4J along with a black-list datasource to identify spam.




回答2:


You need to apply some Bayesian logic, which is what, among other things, Classifier4J that Andrew mentioned is doing beneath the covers.

Paul Graham wrote a good article about this a few years back - http://www.paulgraham.com/spam.html.




回答3:


You could try writing your own classifier etc, but if you have guaranteed network access, how about just using Akismet and the Java bindings? It's pretty good for finding spam.

You'll need to take the network connectivity and licensing into consideration.




回答4:


Easiest way is simply to check against known spam words. The problem here is that it's easy to get false positives with words that mean different things in different contexts. You either need to hand-pick the word list and only include those which have no legitimate reason, or opt for a more heavyweight solution.



来源:https://stackoverflow.com/questions/1158877/easiest-way-to-check-if-a-java-string-instance-might-hold-spam-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!