How to determine the (natural) language of a document?

前端 未结 11 1647
情话喂你
情话喂你 2020-12-24 07:16

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on t

11条回答
  •  鱼传尺愫
    2020-12-24 08:09

    I believe the standard procedure is to measure the quality of a proposed algorithm with test data (i.e. with a corpus). Define the percentage of correct analysis that you would like the algorithm to achieve, and then run it over a number of documents which you have manually classified.

    As for the specific algorithm: using a list of stop words sounds fine. Another approach that has been reported to work is to use a Bayesian Filter, e.g. SpamBayes. Rather than training it into ham and spam, train it into English and German. Use a portion of your corpus, run that through spambayes, and then test it on the complete data.

提交回复
热议问题