Detecting language using Stanford NLP

前端 未结 2 484
我寻月下人不归
我寻月下人不归 2020-12-31 13:25

I\'m wondering if it is possible to use Stanford CoreNLP to detect which language a sentence is written in? If so, how precise can those algorithms be?

2条回答
  •  旧时难觅i
    2020-12-31 14:02

    Almost certainly there is no language identification in Stanford COreNLP at this moment. 'almost' - because nonexistence is much harder to prove.

    EDIT: Nevertheless, below are circumstantial evidences:

    1. there is no mention of language identification neither on main page, nor CoreNLP page, nor in FAQ (although there is a question 'How do I run CoreNLP on other languages?'), nor in 2014 paper of CoreNLP's authors;
    2. tools that combine several NLP libs including Stanford CoreNLP use another lib for language identification, for example DKPro Core ASL; also other users talking about language identification and CoreNLP don't mention this capability
    3. source file of CoreNLP contains Language classes, but nothing related to language identification - you can check manually for all 84 occurrence of 'language' word here

    Try TIKA, or TextCat, or Language Detection Library for Java (they report "99% over precision for 53 languages").

    In general, quality depends on the size of input text: if it is long enough (say, at least several words and not specially chosen), then precision can be pretty good - about 95%.

提交回复
热议问题