Stuck finding deepest path in general tree traversal trying to find largest common substring

问题

I am trying to solve the problem of largest common substring between 2 Strings. I will reduce my problem to the following: I created a general suffix tree and as per my understanding the largest common substring is the deepest path consisting of nodes that belongs to both strings.

My test input is:

String1 = xabc
String2 = abc

It seems that the tree I build is correct but my problem is the following method (I pass the root of the tree initially):

private void getCommonSubstring(SuffixNode node) {  
   if(node == null)  
    return;  
   if(node.from == ComesFrom.Both){  
    current.add(node);          
   }  
   else{  
    if(max == null || current.size() > max.size()){  
        max = current;              
    }  
    current = new ArrayList<SuffixNode>();   
   }  
   for(SuffixNode n:node.children){  
    getCommonSubstring(n);  
   }  
}

What I was aiming to do is, in order to find the deepest path with nodes that belong to both strings, I would traverse the tree (pre-order) and add nodes that belong to both strings in a list (current). Once I am in a node that is not part of both I update max list if current is bigger.

But the code is erroneous. And I am confused on how to implement this, since I haven't written code for general (non-binary) trees in ages.

Could you help me figure this out?

Update:
Modified as per @templatetypedef. Could not make this work either.

private void getCommonSubstring(SuffixNode node, List<SuffixNode> nodes) {  
   if(node == null)  
    return;  
   if(node.from == ComesFrom.Both){  
    nodes.add(node);              
   }  
   else{  
       if(max == null || current.size() > max.size()){  
       max = nodes;               
    }  
    nodes = new ArrayList<SuffixNode>();  
   }  
   for(SuffixNode n:node.children){  
    List<SuffixNode> tmp = new ArrayList<SuffixNode>(nodes);  
    getCommonSubstring(n, tmp);  
   }  
}  


public class SuffixNode {
    Character character;  
    Collection<SuffixNode> children;  
    ComesFrom from;  
    Character endMarker;  
}

回答1:

One issue I see here is that the depth of a node in the suffix tree is not the same as the length of the string along that path. Each edge in a suffix tree is annotated with a range of characters, so a string encoded by a series of nodes of depth five might have shorter length than a string encoded at depth two. You will probably need to adjust your algorithm to handle this by tracking the effective length of the string that you've built up so far, rather than the number of nodes in the path that you've traced out up to this point.

A second issue I just noticed is that you seem to only have one current variable that is getting shared across all the different branches of the recursion. This probably is messing up your state across recursive calls. For example, suppose that you're at a node and have a path of length three, and that there are two children - the first of which only ends in a suffix of the first string, and the second of which ends in a suffix of both strings. In that case, if you make the recursive call on the first string, you will end up executing the line

current = new ArrayList<SuffixNode>();

in the recursive call. This will then clear all your history, so when you return from this call back up to the original node and descend into the second child node, you will act as if there is no list of nodes built up so far, instead of continuing on from the three nodes you found so far.

To fix this, I'd suggest making current a parameter to the function and then creating a new ArrayList when needed, rather than wiping out the existing ArrayList. Some of the other logic might have to change as well to account for this.

Given this, I would suggest writing the function in pseudocode like this (since I don't know the particulars of your suffix tree implementations):

If the current node is null, return 0.
If the current node doesn't come from both strings, return 0.
Otherwise:
- Set maxLen = 0.
- For each child node:
  - Compute the length of the longest common substring rooted at that node.
  - Add to that length the number of characters along the edge to that child.
  - Update maxLen if this exceeds the old value.
- Return maxLen.

Hope this helps!

回答2:

While not an answer, this is how I would solve it using standard collections with O(n log n) lookup.

static String findLongestCommonSubstring(String s1, String s2) {
    if (s1.length() > s2.length()) return findLongestCommonSubstring(s2, s1);

    NavigableSet<String> substrings = new TreeSet<>();
    for (int i = 0; i < s1.length(); i++)
        substrings.add(s1.substring(i));
    String longest = "";
    for (int i = 0; i < s2.length(); i++) {
        String sub2 = s2.substring(i);
        String floor = match(substrings.floor(sub2), sub2);
        String ceiling = match(substrings.ceiling(sub2), sub2);
        if (floor.length() > longest.length())
            longest = floor;
        if (ceiling.length() > longest.length())
            longest = ceiling;
    }
    return longest;
}

private static String match(String s1, String s2) {
    if (s1 == null || s2 == null) return "";
    for (int i = 0; i < s1.length() && i < s2.length(); i++)
        if (s1.charAt(i) != s2.charAt(i))
            return s1.substring(0, i);
    return s1.substring(0, Math.min(s1.length(), s2.length()));
}

public static void main(String... args) {
    System.out.println(findLongestCommonSubstring("sdlkjfsdkljfkljsdlfkjaeakjf", "kjashdkasjdlkjasdlfkjaesdlk"));
}

prints

sdlfkjae

回答3:

Do you HAVE to go the route of a suffix tree? If not, why couldn't you:

public String findCommonSubString(string str1, string str2) {
   string mainStr;
   string otherStr;
   string commonStr = "";
   string foundCommonStr = "";
   boolean strGrowing = false;

   If (str1.length() > str2.length()) {
      mainStr = str1;
      otherStr = str2;
   } else {
      mainStr = str2;
      otherStr = str1;
   }

   int strCount = 0;

   for(int x = 0; x < mainStr.length();x++) {
      strCount = 1;
      strGrowing = true;

      while(strGrowing) {
         if (otherStr.IndexOf(mainStr.substring(x, strCount) >= 0) {
            //Found a match now add a character to it.
            strCount++;

            foundCommonStr = mainStr.substring(x, strCount);

            if (foundCommonStr.length() > commonStr.length()) {
               commonStr = foundCommonStr;
            }
         } else {
            strGrowing = false;
         }
      }

   }

return commonStr;

}

I have not run this...but I will. Basically, this will start with the smallest of the two strings and will attempt to find a common string between the two strings using IndexOf and substring. then if it does it will check again but this time check by adding one more character from the smaller string to the check. It will only store the string in the commonStr variable if the string found (foundCommonStr) is larger than the commonStr. If it doesn't find a match then it has already stored the largest commonStr for being returned.

I believe the idea is sound but I haven't run this in the compiler.

回答4:

public String findCommonSubString(string str1, string str2) {
   string mainStr;
   string otherStr;
   string commonStr = "";
   string foundCommonStr = "";
   boolean strGrowing = false;

   If (str1.length() > str2.length()) {
      mainStr = str1;
      otherStr = str2;
   } else {
      mainStr = str2;
      otherStr = str1;
   }

   int strCount = 0;

   for(int x = 0; x < mainStr.length();x++) {
      strCount = 1;
      strGrowing = true;

      while(strGrowing) {
         if (otherStr.IndexOf(mainStr.substring(x, strCount) >= 0) {
            //Found a match now add a character to it.
            strCount++;

            foundCommonStr = mainStr.substring(x, strCount);

            if (foundCommonStr.length() > commonStr.length()) {
               commonStr = foundCommonStr;
            }
         } else {
            strGrowing = false;
         }
      }

   }

return commonStr;

}

来源：https://stackoverflow.com/questions/14146843/stuck-finding-deepest-path-in-general-tree-traversal-trying-to-find-largest-comm

标签

java

algorithm

data-structures

tree

suffix-tree