问题
I'm working on spell checking of mixed language webpages, and haven't been able to find any existing research on the subject.
The aim is to automatically detect language at a sentence level within mixed language webpages and spell check each against their appropriate language automatically. Assume that we can ignore sentences which mix multiple languages together (e.g. "He has a certain je ne sais quoi"), and assume webpages can't contain more than 2 or 3 languages.
Trivial example (Welsh + English): http://wales.gov.uk/
I'm currently using a mix of:
- Character distribution (e.g. 0600-06FF = Arabic etc)
- n-Grams to discern languages with similar characters
- Dictionary lookup to discern locale, i.e. en-US, en-GB
I have working code but am concerned it may be naive or needlessly re-inventing a wheel. Has anyone else done this before?
回答1:
You can use API (Google & Yandex) for spell check and language detection - but this option is not very scalable I think.
Other option is to use free lucene tools for spellchecking http://wiki.apache.org/lucene-java/SpellChecker, but you have to index some corpra first - Wikipedia is good choice. LD can be archived by http://textcat.sourceforge.net/
回答2:
With the Languagetool http:/www.languagetool.org Library you can select the languages you need and have the content checked against your set of languages. E.g. for a French/English website you'd check the text against English and French. Obviously there will be more errors when you check against the wrong language.
Example:
If you e.g. check the french text from http://fr.wikipedia.org/wiki/Charte_de_la_langue_fran%C3%A7aise:
La Charte de la langue française (communément appelée la loi 1011) est
une loi définissant les droits linguistiques de tous les citoyens du
Québec et faisant du français la langue officielle du Québec.
on http://www.languagetool.org it will show no errors for French and more than 20 errors for English/GB.
The corresponding english text:
The Charter of the French Language (French: La charte de la langue française), also
known as Bill 101 (Law 101 or French: Loi 101), is a law in the province of Quebec
in Canada defining French, the language of the majority of the population, as the
official language of Quebec and framing fundamental language rights. It is the central
legislative piece in Quebec's language policy.
will show 4 errors for English/GB (due to the French citation) and more than 20 errors when you check it agains the French language.
来源:https://stackoverflow.com/questions/5873601/multilingual-spell-checking-with-language-detection