Is there an easy way generate a probable list of words from an unspaced sentence in python?

本小妞迷上赌 提交于 2019-12-12 10:33:24

问题


I have some text:

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?


回答1:


Greedy approach using trie

Try this using Biopython (pip install biopython):

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
    tr = trie.trie()
    with open(dictfile) as f:
        for line in f:
            word = line.rstrip()
            try:
                word = word.encode(encoding='ascii', errors='ignore')
                tr[word] = len(word)
                assert tr.has_key(word), "Missing %s" % word
            except UnicodeDecodeError:
                pass
    return tr


def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def main(s):
    tr = get_trie()
    while s:
        word, s = get_trie_word(tr, s)
        print word

if __name__ == '__main__':
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
    s = s.strip(string.punctuation)
    s = s.replace(" ", '')
    s = s.lower()
    main(s)

Results

>>> if __name__ == '__main__':
...     s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
...     s = s.strip(string.punctuation)
...     s = s.replace(" ", '')
...     s = s.lower()
...     main(s)
... 
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches

Caveats

There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.

Obligatory test

>>> main("expertsexchange")
experts
exchange



回答2:


This is sort of a problem that occurs often in Asian NLP. If you have a dictionary, then you can use this http://code.google.com/p/mini-segmenter/ (Disclaimer: i wrote it, hope you don't mind).

Note that the search space might be extremely large because the number of characters in alphabetic English is surely longer than syllabic chinese/japanese.



来源:https://stackoverflow.com/questions/15364975/is-there-an-easy-way-generate-a-probable-list-of-words-from-an-unspaced-sentence

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!