How to detect language of user entered text? [closed]

て烟熏妆下的殇ゞ 提交于 2019-12-27 11:39:13

问题


I am dealing with an application that is accepting user input in different languages (currently 3 languages fixed). The requirement is that users can enter text and dont bother to select the language via a provided checkbox in the UI.

Is there an existing Java library to detect the language of a text?

I want something like this:

text = "To be or not to be thats the question."

// returns ISO 639 Alpha-2 code
language = detect(text);

print(language);

result:

EN

I dont want to know how to create a language detector by myself (i have seen plenty of blogs trying to do that). The library should provide a simple APi and also work completely offline. Open-source or commercial closed doesn't matter.

i also found this questions on SO (and a few more):

How to detect language
How to detect language of text?


回答1:


Here are two options

  • LanguageIdentifier
  • Rosette Language Identifier



回答2:


This Language Detection Library for Java should give more than 99% accuracy for 53 languages.

Alternatively, there is Apache Tika, a library for content analysis that offers much more than just language detection.




回答3:


Google offers an API that can do this for you. I just stumbled across this yesterday and didn't keep a link, but if you, umm, Google for it you should manage to find it.

This was somewhere near the description of their translation API, which will translate text for you into any language you like. There's another call just for guessing the input language.

Google is among the world's leaders in mechanical translation; they base their stuff on extremely large corpuses of text (most of the Internet, kinda) and a statistical approach that usually "gets" it right simply by virtue of having a huge sample space.

EDIT: Here's the link: http://code.google.com/apis/ajaxlanguage/

EDIT 2: If you insist on "offline": A well upvoted answer was the suggestion of Guess-Language. It's a C++ library and handles about 60 languages.




回答4:


Detect Language API also provides Java client.

Example:

List<Result> results = DetectLanguage.detect("Hello world");

Result result = results.get(0);

System.out.println("Language: " + result.language);
System.out.println("Is reliable: " + result.reliable);
System.out.println("Confidence: " + result.confidence);



回答5:


An alternative is the JLangDetect but it's not very robust and has a limited language base. Good thing is it's an Apache license, if it satisfies your requirements, you can use it. I'm guessing here, but do you release the space key between the single and double jump event? Version 0.2 has been released here.

In version 0.4 it is very robust. I have been using this in many projects of my own and never had any major problems. Also, when it comes to speed it is comparable to very specialized language detectors (e.g., few languages only).




回答6:


here is another option : Language Detection Library for Java

this is a library in Java.




回答7:


Just a working code from already available solution from cybozu labs:

package com.et.generate;

import java.util.ArrayList;
import com.cybozu.labs.langdetect.Detector;
import com.cybozu.labs.langdetect.DetectorFactory;
import com.cybozu.labs.langdetect.LangDetectException;
import com.cybozu.labs.langdetect.Language;

public class LanguageCodeDetection {

    public void init(String profileDirectory) throws LangDetectException {
        DetectorFactory.loadProfile(profileDirectory);
    }
    public String detect(String text) throws LangDetectException {
        Detector detector = DetectorFactory.create();
        detector.append(text);
        return detector.detect();
    }
    public ArrayList<Language> detectLangs(String text) throws LangDetectException {
        Detector detector = DetectorFactory.create();
        detector.append(text);
        return detector.getProbabilities();
    }
    public static void main(String args[]) {
        try {
            LanguageCodeDetection ld = new  LanguageCodeDetection();

            String profileDirectory = "C:/profiles/";
            ld.init(profileDirectory);
            String text = "Кремль россий";
            System.out.println(ld.detectLangs(text));
            System.out.println(ld.detect(text));
        } catch (LangDetectException e) {
            e.printStackTrace();
        }
    }

}

Output:
[ru:0.9999983255911719]
ru

Profiles can be downloaded from: https://language-detection.googlecode.com/files/langdetect-09-13-2011.zip



来源:https://stackoverflow.com/questions/3227524/how-to-detect-language-of-user-entered-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!