I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few
The ICU library (that has Python and Java bindings) has a DictionaryBasedBreakIterator class that can be used for this.