Detect most likely words from text without spaces / combined words

安稳与你 提交于 2019-11-26 14:04:49

问题


Is there a good library can detect and split words from a combined string?

Example:

"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]

回答1:


Here's a dynamic programming solution (implemented as a memoized function). Given a dictionary of words with their frequencies, it splits the input text at the positions that give the overall most likely phrase. You'll have to find a real wordlist, but I included some made-up frequencies for a simple test.

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])



回答2:


I don't know of any library for it, but it shouldn't be hard to implement basic functionality.

  1. Get a words list, like UNIX's words.
  2. Stuff the contents of your word list into a trie.
  3. Take the string you want to split and follow its path in the trie. Each time you reach a valid word, create a new branch that searches for a word starting from the point of the string that you got to. Once you finish your current branch, backtrack to the one you created, like in a depth first search.
  4. Disambiguate the resulting lists manually, using heuristics or through a natural language parser.

Example:

  1. Word: "filesaveasstring"
  2. First valid word is "file". Try matching "saveas". First valid word is "save". Try matching "asstring". First valid word is "as". Try matching "string". First valid word is "string". Matched until end; put the [file save as string] into your results list.
  3. Backtrack to matching "string" - no other possibilities. Backtrack to "asstring". First unvisited valid word is "ass". Try matching "tring". No possible matches. Backtrack to "asstring". No possible matches. Backtrack to "filesaveasstring".
  4. First unvisited match is "files". Try to match "aveasstring". First match is "ave". Try matching "asstring" (same results as steps 2/3), adding [files ave as string] to your results list and backtrack to the start.
  5. Try matching "filesaveasstring". No unvisited matches. Done.
  6. Select the most likely from [[file save as string] [files ave as string]] using a heuristic or a natural language parser.



回答3:


Get people to solve them as a captcha on your website :)




回答4:


I don't know a library that does this, but it's not too hard to write if you have a list of words:

wordList = file('words.txt','r').read().split()
words = set( s.lower() for s in wordList )

def splitString(s):
    found = []

    def rec(stringLeft, wordsSoFar):
        if not stringLeft:
            found.append(wordsSoFar)
        for pos in xrange(1, len(stringLeft)+1):
            if stringLeft[:pos] in words:
                rec(stringLeft[pos:], wordsSoFar + [stringLeft[:pos]])

    rec(s.lower(), [])
    return found

This will return all possible ways to split the string into the given words.

Example:

>>> splitString('filesaveas')
[['file', 'save', 'as'], ['files', 'ave', 'as']]



回答5:


if you are not doing this for fun, but is actually doing something for work etc, my advice is to tackle this at the source. Why do you have these strings combined like that? Where did you get those strings? If its possible, insert spaces at the source of where those strings come from.




回答6:


I know this question is marked for Python but I needed a JavaScript implementation. Going off of the previous answers I figured I'd share my code. Seems to work decently.

function findWords(input){
    input = input.toLowerCase().replace(/\s/g, ""); //Strip whitespace

    var index = 0;
    var validWords = [];
    for (var len = input.length; len > 0; len--){ //Go backwards as to favor longer words
        var testWord = input.substr(index, len);
        var dictIndex = _dictionary.indexOf(testWord.replace(/[^a-z\']/g, "")); //Remove non-letters
        if (dictIndex != -1){
            validWords.push(testWord);
            if (len == input.length){
                break; //We are complete
            }
            var nextWords = findWords(input.substr(len, input.length - len)); //Recurse
            if (!nextWords.words.length){ //No further valid words
                validWords.pop();
            }
            validWords = validWords.concat(nextWords.words);
            if (nextWords.complete === true){
                break; //Cascade complete
            }
        }
    }
    return {
        complete:len > 0, //We broke which indicates completion
        words:validWords
    };
}

Note: "_dictionary" is expected to be an array of words sorted by frequency. I am using a wordlist from Project Gutenberg.



来源:https://stackoverflow.com/questions/2174093/detect-most-likely-words-from-text-without-spaces-combined-words

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!