ഇതുഒരുസ്ടലംമാണ്
itu oru stalam anu
This is a Unicode string meaning this is a place
import nltk
nltk.w
I tried the following:
# encoding=utf-8
import nltk
cheese = nltk.wordpunct_tokenize('ഇതുഒരുസ്ഥാലമാണ്'.decode('utf8'))
for var in cheese:
print var.encode('utf8'),
And as output, I got the following:
ഇത ു ഒര ു സ ് ഥ ാ ലമ ാ ണ ്
Is this anywhere close to the output that you want, I'm a little in the dark here, since its difficult to get this right without understanding the language.
A tokenizer is indeed the right tool; certainly this is what the NLTK calls them. A morphological analyzer (as in the article you link to) is for breaking words into smaller parts (morphemes). But in your example code, you tried to use a tokenizer that is appropriate for English: It recognizes space-delimited words and punctuation tokens. Since Malayalam evidently doesn't indicate word boundaries with spaces, or with anything else, you need a different approach.
So the NLTK doesn't provide anything that detects word boundaries for Malayalam. It might provide the tools to build a decent one fairly easily, though.
The obvious approach would be to try dictionary lookup: Try to break up your input into strings that are in the dictionary. But it would be harder than it sounds: You'd need a very large dictionary, you'd still have to deal with unknown words somehow, and since Malayalam has non-trivial morphology, you may need a morphological analyzer to match inflected words to the dictionary. Assuming you can store or generate every word form with your dictionary, you can use an algorithm like the one described here (and already mentioned by @amp) to divide your input into a sequence of words.
A better alternative would be to use a statistical algorithm that can guess where the word boundaries are. I don't know of such a module in the NLTK, but there has been quite a bit of work on this for Chinese. If it's worth your trouble, you can find a suitable algorithm and train it to work on Malayalam.
In short: The NLTK tokenizers only work for the typographical style of English. You can train a suitable tool to work on Malayalam, but the NLTK does not include such a tool as far as I know.
PS. The NLTK does come with several statistical tokenization tools; the PunctSentenceTokenizer can be trained to recognize sentence boundaries using an unsupervised learning algorithm (meaning you don't need to mark the boundaries in the training data). Unfortunately, the algorithm specifically targets the issue of abbreviations, and so it cannot be adapted to word boundary detection.
It seems like your space is the unicode character u'\u0d41'
. So you should split normally with str.split()
.
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
x = 'ഇതുഒരുസ്ഥാലമാണ്'.decode('utf8')
y = x.split(u'\u0d41')
print " ".join(y)
[out]:
ഇത ഒര സ്ഥാലമാണ്`
After a crash course of the language from wikipedia (http://en.wikipedia.org/wiki/Malayalam), there are some issues in your question and the tools you've requested for your desired output.
Conflated Task
Firstly, the OP conflated the task of morphological analysis, segmentation and tokenization. Often there is a fine distinction especially for aggluntinative languages such as Turkish/Malayalam (see http://en.wikipedia.org/wiki/Agglutinative_language).
Agglutinative NLP and best practices
Next, I don't think tokenizer
is appropriate for Malayalam, an agglutinative language. One of the most studied aggluntinative language in NLP, Turkish have adopted a different strategy when it comes to "tokenization", they found that a full blown morphological analyzer is necessary (see http://www.denizyuret.com/2006/11/turkish-resources.html, www.andrew.cmu.edu/user/ko/downloads/lrec.pdf).
Word Boundaries
Tokenization is defined as the identification of linguistically meaningful units (LMU) from the surface text (see Why do I need a tokenizer for each language?) And different language would require a different tokenizer to identify the word boundary of different languages. Different people have approach the problem for finding word boundary different but in summary in NLP people have subscribed to the following:
Agglutinative Languages requires a full blown morphological analyzer trained with some sort of language models. There is often only a single tier when identifying what is token
and that is at the morphemic level hence the NLP community had developed different language models for their respective morphological analysis tools.
Polysynthetic Languages with specified word boundary has the choice of a two tier tokenization
where the system can first identify an isolated word and then if necessary morphological analysis should be done to obtain a finer grain tokens. A coarse grain tokenizer can split a string using certain delimiter (e.g. NLTK's word_tokenize
or punct_tokenize
which uses whitespaces/punctuation for English). Then for finer grain analysis at morphemic level, people would usually use some finite state machines to split words up into morpheme (e.g. in German http://canoo.net/services/WordformationRules/Derivation/To-N/N-To-N/Pre+Suffig.html)
Polysynthetic Langauges without specified word boundary often requires a segmenter first to add whitespaces between the tokens because the orthography doesn't differentiate word boundaries (e.g. in Chinese https://code.google.com/p/mini-segmenter/). Then from the delimited tokens, if necessary, morphemic analysis can be done to produce finer grain tokens (e.g. http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html). Often this finer grain tokens are tied with POS tags.
The answer in brief to OP's request/question, the OP had used the wrong tools for the task:
tokens
for Malayalam, a morphological analyzer is necessary, simple coarse grain tokenizer in NLTK would not work.re.split()
to achieve a baseline tokenizer.Morphological analysis example
from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("കേരളത്തിന്റെ")
Gives
[('കേരളം<np><genitive>', 179)]
url: mlmorph
if you using anaconda then: install git in anaconda prompt
conda install -c anaconda git
then clone the file using following command:
git clone https://gitlab.com/smc/mlmorph.git
maybe the Viterbi algorithm could help?
This answer to another SO question (and the other high-vote answer) could help: https://stackoverflow.com/a/481773/583834