Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

妖精的绣舞 提交于 2019-12-05 04:46:54

I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.

First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.

Once you have the words, you have two possible approaches:

You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.

On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.

I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.

I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.

This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution

 textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."

indiv_characters = list(textstring) #splits string into individual characters

teststring = ''
sequential_indiv_word_list = []

for cur_char in indiv_characters:
    teststring = teststring + cur_char
    # do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
    if in_english_dict == True:
        sequential_indiv_word_list.append(teststring)
        teststring = ''

#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word

There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.

Also you need to account for punctuation, write conditionals like

if cur_char == ',' or cur_char =='.':
   #do action to start new "word" automatically
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!