问题

I have some text:

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?

回答1:

Greedy approach using trie

Try this using Biopython (pip install biopython):

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
    tr = trie.trie()
    with open(dictfile) as f:
        for line in f:
            word = line.rstrip()
            try:
                word = word.encode(encoding='ascii', errors='ignore')
                tr[word] = len(word)
                assert tr.has_key(word), "Missing %s" % word
            except UnicodeDecodeError:
                pass
    return tr


def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def main(s):
    tr = get_trie()
    while s:
        word, s = get_trie_word(tr, s)
        print word

if __name__ == '__main__':
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
    s = s.strip(string.punctuation)
    s = s.replace(" ", '')
    s = s.lower()
    main(s)

Results

>>> if __name__ == '__main__':
...     s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
...     s = s.strip(string.punctuation)
...     s = s.replace(" ", '')
...     s = s.lower()
...     main(s)
... 
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches

Caveats

There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.

Obligatory test

>>> main("expertsexchange")
experts
exchange

回答2:

This is sort of a problem that occurs often in Asian NLP. If you have a dictionary, then you can use this http://code.google.com/p/mini-segmenter/ (Disclaimer: i wrote it, hope you don't mind).

Note that the search space might be extremely large because the number of characters in alphabetic English is surely longer than syllabic chinese/japanese.

来源：https://stackoverflow.com/questions/15364975/is-there-an-easy-way-generate-a-probable-list-of-words-from-an-unspaced-sentence

标签

python

nlp

Is there an easy way generate a probable list of words from an unspaced sentence in python?