Split a string to a string of valid words using Dynamic Programming

I need to find a dynamic programming algorithm to solve this problem. I tried but couldn't figure it out. Here is the problem:

You are given a string of n characters s[1...n], which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like "itwasthebestoftimes..."). You wish to reconstruct the document using a dictionary, which is available in the form of a Boolean function dict(*) such that, for any string w, dict(w) has value 1 if w is a valid word, and has value 0 otherwise.

Give a dynamic programming algorithm that determines whether the string s[*] can be reconstituted as a sequence of valid words. The running time should be at most O(n^2), assuming that each call to dict takes unit time.
In the event that the string is valid, make your algorithm output the corresponding sequence of words.

Let the length of your compacted document be N.

Let b(n) be a boolean: true if the document can be split into words starting from position n in the document.

b(N) is true (since the empty string can be split into 0 words). Given b(N), b(N - 1), ... b(N - k), you can construct b(N - k - 1) by considering all words that start at character N - k - 1. If there's any such word, w, with b(N - k - 1 + len(w)) set, then set b(N - k - 1) to true. If there's no such word, then set b(N - k - 1) to false.

Eventually, you compute b(0) which tells you if the entire document can be split into words.

In pseudo-code:

def try_to_split(doc):
  N = len(doc)
  b = [False] * (N + 1)
  b[N] = True
  for i in range(N - 1, -1, -1):
    for word starting at position i:
      if b[i + len(word)]:
        b[i] = True
        break
  return b

There's some tricks you can do to get 'word starting at position i' efficient, but you're asked for an O(N^2) algorithm, so you can just look up every string starting at i in the dictionary.

To generate the words, you can either modify the above algorithm to store the good words, or just generate it like this:

def generate_words(doc, b, idx=0):
  length = 1
  while true:
    assert b(idx)
    if idx == len(doc): return
    word = doc[idx: idx + length]
    if word in dictionary and b(idx + length):
       output(word)
       idx += length
       length = 1

Here b is the boolean array generated from the first part of the algorithm.

To formalize what @MinhPham suggested.

This is a dynammic programming solution.

Given a string str, let

b[i] = true if the substring str[0...i] (inclusive) can be split into valid words.

Prepend some starting character to str, say !, to represent the empty word. str = "!" + str

The base case is the empty string, so

b[0] = true.

For the iterative case:

b[j] = true if b[i] == true and str[i..j] is a word for all i < j

The O(N^2) Dp is clear but if you know the words of the dictionary, i think you can use some precomputations to get it even faster in O(N). Aho-Corasick

A dp solution in c++:

int main()
{
    set<string> dict;
    dict.insert("12");
    dict.insert("123");
    dict.insert("234");
    dict.insert("12345");
    dict.insert("456");
    dict.insert("1234");
    dict.insert("567");
    dict.insert("123342");
    dict.insert("42");
    dict.insert("245436564");
    dict.insert("12334");

    string str = "123456712334245436564";

    int size = str.size();
    vector<int> dp(size+1, -1);
    dp[0] = 0;
    vector<string > res(size+1);
    for(int i = 0; i < size; ++i)
    {
        if(dp[i] != -1)
        {
            for(int j = i+1; j <= size; ++j)
            {
                const int len = j-i;
                string substr = str.substr(i, len);
                if(dict.find(substr) != dict.end())
                {
                    string space = i?" ":"";
                    res[i+len] = res[i] + space + substr;
                    dp[i+len] = dp[i]+1;
                }
            }
        }
    }
    cout << *dp.rbegin() << endl;
    cout << *res.rbegin() << endl;

    return 0;
}

The string s[] can potentially be split into more than one ways. The method below finds the maximum number of words in which we can split s[]. Below is the sketch/pseudocode of the algorithm

bestScore[i] -> Stores the maximum number of words in which the first i characters can be split (it would be MINUS_INFINITY otherwise)

for (i = 1 to n){
     bestScore[i] = MINUS_INFINITY
     for (k = 1 to i-1){
        bestScore[i] = Max(bestSCore[i], bestScore[i-k]+ f(i,k))
     }
 }

Where f(i,k) is defined as:

f(i,k) = 1 : if s[i-k+1 to i] is in dictionary
       = MINUS_INFINITY : otherwise

bestScore[n] would store the maximum number of words in which s[] can be split (if the value is MINUS_INFINIY, s[] cannot be split)

Clearly the running time is O(n^2)

As this looks like a textbook exercise, I will not write the code to reconstruct the actual split positions.

Below is an O(n^2) solution for this problem.

void findstringvalid() {
string s = "itwasthebestoftimes";
set<string> dict;
dict.insert("it");
dict.insert("was");
dict.insert("the");
dict.insert("best");
dict.insert("of");
dict.insert("times");

vector<bool> b(s.size() + 1, false);
vector<int> spacepos(s.size(), -1);
//Initialization phase
b[0] = true; //String of size 0 is always a valid string
for (int i = 1; i <= s.size(); i++) {
    for (int j = 0; j <i; j++) {
       //string of size s[ j... i]
       if (!b[i]) {
           if (b[j]) {
              //check if string "j to i" is in dictionary
              string temp = s.substr(j, i - j);
              set<string>::iterator it = dict.find(temp);
              if (it != dict.end()) {
                  b[i] = true;
                  spacepos[i-1] = j;
              }
           }
        }
    }
}
if(b[s.size()])
    for (int i = 1; i < spacepos.size(); i++) {
        if (spacepos[i] != -1) {
            string temp = s.substr(spacepos[i], i - spacepos[i] + 1);
            cout << temp << " ";
    }
    }
}

来源：https://stackoverflow.com/questions/5310756/split-a-string-to-a-string-of-valid-words-using-dynamic-programming

标签

algorithm

dynamic

big-o

text-segmentation