How to determine the (natural) language of a document?

前端未结

关注

 11  1647

情话喂你 2020-12-24 07:16

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on t

11条回答

鱼传尺愫 (楼主)

2020-12-24 08:09

I believe the standard procedure is to measure the quality of a proposed algorithm with test data (i.e. with a corpus). Define the percentage of correct analysis that you would like the algorithm to achieve, and then run it over a number of documents which you have manually classified.

As for the specific algorithm: using a list of stop words sounds fine. Another approach that has been reported to work is to use a Bayesian Filter, e.g. SpamBayes. Rather than training it into ham and spam, train it into English and German. Use a portion of your corpus, run that through spambayes, and then test it on the complete data.

0 讨论(0)

查看其它11个回答
发布评论:

提交评论
- 加载中...