A Viable Solution for Word Splitting Khmer?

问题

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few solutions out there, but they are far from adequate (here and here), and those projects have fallen by the wayside.

Here is a sample line of Khmer that needs to be split (they can be longer than this):

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ ហើយដែលអ្នកមិនអាចរកការទាំងអស់នោះដោយសារការប្រព្រឹត្តរបស់អ្នកឡើយ។

The goal of creating a viable solution that splits Khmer words is twofold: it will encourage those who used Khmer legacy (non-Unicode) fonts to convert over to Unicode (which has many benefits), and it will enable legacy Khmer fonts to be imported into Unicode to be used with a spelling checker quickly (rather than manually going through and splitting words which, with a large document, can take a very long time).

I don't need 100% accuracy, but speed is important (especially since the line that needs to be split into Khmer words can be quite long). I am open to suggestions, but currently I have a large corpus of Khmer words that are correctly split (with a non-breaking space), and I have created a word probability dictionary file (frequency.csv) to use as a dictionary for the word splitter.

I found this python code here that uses the Viterbi algorithm and it supposedly runs fast.

import re
from itertools import groupby

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary.get(word, 0) / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = dict((w, len(list(ws)))
                  for w, ws in groupby(sorted(words(open('big.txt').read()))))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

I also tried using the source java code from the author of this page: Text segmentation: dictionary-based word splitting but it ran too slow to be of any use (because my word probability dictionary has over 100k terms...).

And here is another option in python from Detect most likely words from text without spaces / combined words:

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

I am a newbee when it comes to python and I am really new to all real programming (outside of websites), so please bear with me. Does anyone have any options that they feel would work well?

回答1:

The ICU library (that has Python and Java bindings) has a DictionaryBasedBreakIterator class that can be used for this.

回答2:

The python with example filesaveas appears to recurse through the entire input string (for i in xrange(1, len(text) + 1)), stuffing the best results into the cache along the way; at each potential word, it then starts looking at the next word (which will in turn look at the word after that, and so on), and if that second word doesn't look very good, it won't save that particular one. It feels like O(N!) runtime, where N is the length of the input string.

Super clever, but probably horrible for anything but simple tasks. What's the longest Khmer word you've got? I'm hoping < 20 characters.

Maybe if you feed input into that example 20 characters at a time you can keep the runtime down to something approaching reasonable. Feed in the first 20 chars, suck off the first word, and then feed in the remaining input. If you re-use the cache it might do something silly like store partial words along the way.

On a completely different tack, how many Khmer words are formed by concatenating two or more legal Khmer words? (similar to 'penknife' or 'basketball') If not too many, it might make sense to create a set of dictionaries, segregated by length of word, mapping from word to probability of use.

Say, the longest Khmer word is 14 chars long; feed in 14 characters of input into the len14 dictionary, store the probability. Feed in 13 characters into len13, store the probability. Feed in 12 characters ... all the way down to 1 into len1. Then pick the interpretation with the highest probability, save the word, strip off that many characters, and try again.

So it won't fail badly for inputs like "I" vs "Image", maybe longer inputs should have automatically inflated probabilities?

Thanks for the fun question ;) I didn't know of any languages like this, pretty cool.

回答3:

I think this is a good idea, as it is.

I suggest you, when you have some experience with it, you add some rules, that can be very specific, for example, depending on word before, depending on word after, depending on surrounding words, depending on a sequence of words before the current word, just to enumerate the most frequent ones. You can find a set of rules in gposttl.sf.net project, which is a pos tagging project, in the file data/contextualrulefile.

Rules should be used AFTER the statistics evaluation is finished, they make some fine tuning, and can improve accuracy remarkably.

来源：https://stackoverflow.com/questions/4861619/a-viable-solution-for-word-splitting-khmer

标签

python

nlp

word-boundary

text-segmentation

southeast-asian-languages