Find the longest word given a collection

落花浮王杯 提交于 2019-12-20 08:29:20

问题


It is a google interview question and I find most answers online using HashMap or similar data structure. I am trying to find a solution using Trie if possible. Anybody could give me some hints?

Here is the question: You are given a dictionary, in the form of a file that contains one word per line. E.g.,

abacus 
deltoid 
gaff 
giraffe 
microphone 
reef 
qar 

You are also given a collection of letters. E.g.,

{a, e, f, f, g, i, r, q}. 

The task is to find the longest word in the dictionary that can be spelled with the collection of letters. For example, the correct answer for the example values above is “giraffe”. (Note that “reef” is not a possible answer, because the set of letters contains only one “e”.)

Java implementation would be preferred.


回答1:


No Java code. You can figure that out for yourself.

Assuming that we need to do this lots of times, here's what I'd do:

  • I'd start by creating "signatures" for each word in the dictionary consisting of 26 bits, where bit[letter] is set iff the word contains one (or more) instances of letter. These signatures can be encoded as a Java int.

  • Then create a mapping that maps signatures to lists of words with that signature.

To do a search using the precomputed map:

  • Create the signature for the set of letters you want to find the words for.

  • Then iterate over the keys of the mapping looking for keys where (key & (~signature) == 0). That gives you a short list of "possibles" that don't contain any letter that is not in the required letter set.

  • Iterate over the short list looking for words with the right number of each of the required letters, recording the longest hit.


Notes:

  1. While the primary search is roughly O(N) on the number of words in the dictionary, the test is extremely cheap.

  2. This approach has the advantage of requiring a relatively small in-memory data structure, that (most likely) has good locality. That is likely to be conducive to faster searches.


Here's an idea for speeding up the O(N) search step above.

Starting with the signature map above, create (precompute) derivative maps for all words that do contain specific pairs letters; i.e. one for words containing AB, for AC, BC, ... and for YZ. Then if you are looking for words containing (say) P and Q, you can just scan the PQ derivative map. That will reduce O(N) step by roughly 26^2 ... at the cost of more memory for the extra maps.

That can be extended to 3 or more letters, but the downside is the explosion in memory usage.

Another potential tweak is to (somehow) bias the selection of the initial letter pair towards letters/pairs that occur infrequently. But that adds an up-front overhead which could be greater than the (average) saving you get from searching a shorter list.




回答2:


I suspect a Trie-based implementation wouldn't be very space-efficient, but it would parallelize very nicely, because you could descend into all branches of the tree in parallel and collected the deepest nodes which you can reach from each top branch with the given set of letters. In the end, you just collect all the deepest nodes and select the longest one.

I'd start with this algorithm (sorry, just pseudo-code), which doesn't attempt to parallelize but just uses plain old recursion (and backtracking) to find the longest match:

TrieNode visitNode( TrieNode n, LetterCollection c )
{
    TreeNode deepestNode = n;
    for each Letter l in c:
        TrieNode childNode = n.getChildFor( l );

        if childNode:
            TreeNode deepestSubNode = visitNode( childNode, c.without( l ) );
            if deepestSubNode.stringLength > deepestNode.stringLength:
                deepestNode = deepestSubNode;
   return deepestNode;
}

I.e. this function is supposed to start at the root node of the trie, with the entire given letter collection. For each letter in the collection, you try to find a child node. If there is one, you recurse and remove the letter from the collection. At one point your letter collection will be empty (best case, all letters consumes - you could actually bail out right away without continueing to traverse the trie) or there will be no more children with any of the remaining letters - in that case you remove the node itself, because that's your "longest match".

This could parallelize quite nicely if you changed the recursion step so that you visit all children in parallel, collect the results - and select the longest result and return that.




回答3:


First off, nice question. The interviewer wants to see how you tackle the problem. In those kinds of problems you are required to analyse the problem and carefully choose a data structure.

In this case, two datastructures come into my mind: HashMaps and Tries. HashMaps are not a good fit, because you don't have a complete key you want to lookup (you can use an inverted index based on maps, but you said you already found those solutions). You only have the parts- that is where the Trie is the best fit.

So the idea with tries is that you can ignore branches of characters that are not in your dictionary while traversing the tree.

In your case, the tree looks like this (I left out the branching for non-branching paths):

*
   a
     bacus
   d 
     deltoid
   g
     a
       gaff
     i
       giraffe
   m 
     microphone
   r 
     reef
   q 
     qar

So at each level of this trie, we look at the children of the current node and check if the child's character is in our dictionary.

If yes: We go deeper in that tree and remove the child's character from our dictionary

This goes on until you hit a leaf (no children anymore), here you know that this word contains all characters in this dictionary. This is a possible candidate. Now we want to go back in the tree until we find another match that we can compare. If the newest found match is smaller, discard it, if longer this is our possible best match candidate now.

Some day, the recusion will finish and you'll end up with the desired output.

Note that this works if there is a single longest word, otherwise you would have to return a list of candidates (this is the unknown part of the interview where you are required to ask what the interviewer wants to see as a solution).

So you have required the Java code, here it is with a simplistic Trie and the single longest word version:

public class LongestWord {

  class TrieNode {
    char value;
    List<TrieNode> children = new ArrayList<>();
    String word;

    public TrieNode() {
    }

    public TrieNode(char val) {
      this.value = val;
    }

    public void add(char[] array) {
      add(array, 0);
    }

    public void add(char[] array, int offset) {
      for (TrieNode child : children) {
        if (child.value == array[offset]) {
          child.add(array, offset + 1);
          return;
        }
      }
      TrieNode trieNode = new TrieNode(array[offset]);
      children.add(trieNode);
      if (offset < array.length - 1) {
        trieNode.add(array, offset + 1);
      } else {
        trieNode.word = new String(array);
      }
    }    
  }

  private TrieNode root = new TrieNode();

  public LongestWord() {
    List<String> asList = Arrays.asList("abacus", "deltoid", "gaff", "giraffe",
        "microphone", "reef", "qar");
    for (String word : asList) {
      root.add(word.toCharArray());
    }
  }

  public String search(char[] cs) {
    return visit(root, cs);
  }

  public String visit(TrieNode n, char[] allowedCharacters) {
    String bestMatch = null;
    if (n.children.isEmpty()) {
      // base case, leaf of the trie, use as a candidate
      bestMatch = n.word;
    }

    for (TrieNode child : n.children) {
      if (contains(allowedCharacters, child.value)) {
        // remove this child's value and descent into the trie
        String result = visit(child, remove(allowedCharacters, child.value));
        // if the result wasn't null, check length and set
        if (bestMatch == null || result != null
            && bestMatch.length() < result.length()) {
          bestMatch = result;
        }
      }
    }
    // always return the best known match thus far
    return bestMatch;
  }

  private char[] remove(char[] allowedCharacters, char value) {
    char[] newDict = new char[allowedCharacters.length - 1];
    int index = 0;
    for (char x : allowedCharacters) {
      if (x != value) {
        newDict[index++] = x;
      } else {
        // we removed the first hit, now copy the rest
        break;
      }
    }
    System.arraycopy(allowedCharacters, index + 1, newDict, index,
        allowedCharacters.length - (index + 1));

    return newDict;
  }

  private boolean contains(char[] allowedCharacters, char value) {
    for (char x : allowedCharacters) {
      if (value == x) {
        return true;
      }
    }
    return false;
  }

  public static void main(String[] args) {
    LongestWord lw = new LongestWord();
    String longestWord = lw.search(new char[] { 'a', 'e', 'f', 'f', 'g', 'i',
        'r', 'q' });
    // yields giraffe
    System.out.println(longestWord);
  }

}

I also can only suggest reading the book Cracking the Coding Interview: 150 Programming Questions and Solutions, it guides you through the decision-making and construction those algorithms specialized on interview questions.




回答4:


Disclaimer: this is not a trie solution, but I still think it's an idea worth exploring.

Create some sort of hash function that only accounts for letters in a word and not their order (no collisions should be possible except in the case of permutations). For example, ABCD and DCBA both generate the same hash (but ABCDD does not). Generate such a hash table containing every word in the dictionary, using chaining to link collisions (on the other hand, unless you have a strict requirement to find "all" longest words and not just one, you can just drop collisions, which are just permutations, and forgo the whole chaining).

Now, if your search set is 4 characters long, for example A, B, C, D, then as a näive search you check the following hashes to see if they are already contained in the dictionary:

hash(A), hash(B), hash(C), hash(D) // 1-combinations
hash(AB), hash(AC), hash(AD), hash(BC), hash(BD), hash(CD) // 2-combinations
hash(ABC), hash(ABD), hash(ACD), hash(BCD) // 3-combinations
hash(ABCD) // 4-combinations

If you search the hashes in that order, the last match you find will be the longest one.

This ends up having a run time which is dependent on the length of the search set rather than the length of the dictionary. If M is the number of characters in the search set, then the number of hash lookups is the sum M choose 1 + M choose 2 + M choose 3 + ... + M choose M which is also the size of the powerset of the search set, so it's O(2^M). At first glance this sounds really bad since it's exponential, but to put things in perspective, if your search set is size 10 there will only be around 1000 lookups, which is probably a lot smaller than your dictionary size in a practical real world scenario. At M = 15 we get 32000 lookups, and really, how many English words are there that are longer than 15 characters?

There are two (alternate) ways I can think of to optimize it though:

1) Search for longer matches first e.g. M-combinations then (M-1)-combinations, etc. As soon as you find a match, you can stop! Chances are you will only cover a small fraction of your search space, probably at worst half.

2) Search for shorter matches first (1-combos, 2-combos, etc). Say you come up with a miss at level 2 (for example, no string in your dictionary is composed only of A and B). Use an auxiliary data structure (a bitmap perhaps) that allows you to check if any word in the dictionary is even partially composed of A and B (in contrast to your primary hash table which checks for complete composition). If you get a miss on the secondary bitmap also, then you know that you can skip all higher level combinations including A and B (i.e. you can skip hash(ABC), hash(ABD), and hash(ABCD) because no words contain both A and B). This leverages the Apriori principle and would drastically reduce the search space as M grows and misses become more frequent. EDIT: I realize that the details I abstract away relating to the "auxiliary data structure" are significant. As I think more about this idea, I realize it is leaning toward a complete dictionary scan as a subprocedure, which defeats the point of this entire approach. Still, it seems there should be a way to use the Apriori principle here.




回答5:


I think the above answers missed the key point. We have a space with 27 dimensions, the first one is the length and the others the coordinates of each letter. In that space we have points, which are words. The first coordinate of a word is his length. The other coordinates are, for each letter-dimension is the number of occurrences of that letter in that word. For example the words abacus, deltoid, gaff, giraffe, microphone, reef, qar, abcdefghijklmnopqrstuvwxyz have coordinates

[3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[6, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]
[7, 0, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
[4, 1, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[7, 1, 0, 0, 0, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[10, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[4, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[26, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

The good structure for a set with coordinates is a R-tree or a R*-Tree. Given your collection [x0, x1, ..., x26], you have to ask all the words that contains at most xi letter, for each letter. Your search is in O(log N), where N is the number of words in your dictionary. However you don't want to look at the biggest word in all the words that match your query. This is why the first dimension is important.

You know that the length of the biggest word is between 0 and X, where X=sum(x_i, i=1..26). You can search iteratively from X to 1, but you can also do a binary search algorithm for the length of the query. You use the first dimension of your array as the query. You start from a=X to b=X/2. If their is at least a match, you search from a to (a+b)/2, else you search from b to b-(a-b)/2=(3b-a)/2. You do that until you have b-a=1. You now have the biggest length and all the matches with this length.

This algorithm is asymptotically much more efficient than the algorithms above. The time complexity is in O(ln(N)×ln(X)). The implementation depend on the R-tree library you use.




回答6:


Groovy (almost Java):

def letters = ['a', 'e', 'f', 'f', 'g', 'i', 'r', 'q']
def dictionary = ['abacus', 'deltoid', 'gaff', 'giraffe', 'microphone', 'reef', 'qar']
println dictionary
    .findAll{ it.toList().intersect(letters).size() == it.size() }
    .sort{ -it.size() }.head()

The choice of collection type to hold the dictionary is irrelevant to the algorithm. If you're supposed to implement a trie, that's one thing. Otherwise, just create one from an appropriate library to hold the data. Neither Java nor Groovy has one in its standard library that I'm aware of.




回答7:


I tried to code this problem in C++ ..where i created my own hash key and go through all the combination with the given characters.

Going through all the combination from these input characters from the largest length to 1

Here is my solution

#include "iostream"
#include <string>

using namespace std;

int hash_f(string s){
        int key=0;
        for(unsigned int i=0;i<s.size();i++){
           key += s[i];
        }
        return key;
}

class collection{

int key[100];
string str[10000];

public: 
collection(){
    str[hash_f( "abacus")] = "abacus"; 
    str[hash_f( "deltoid")] = "deltoid"; 
    str[hash_f( "gaff")] = "gaff"; 
    str[hash_f( "giraffe")] = "giraffe"; 
    str[hash_f( "microphone")] = "microphone"; 
    str[hash_f( "reef")] = "reef"; 
    str[hash_f( "qar")] = "qar"; 
}

string  find(int _key){
    return str[_key];
}
};

string sub_str(string s,int* indexes,int n ){
    char c[20];
    int i=0;
    for(;i<n;i++){
        c[i] = s[indexes[i]];
    }
    c[i] = 0;
    return string(c);
}

string* combination_m_n(string str , int m,int n , int& num){

    string* result = new string[100];
    int index = 0;

    int * indexes = (int*)malloc(sizeof(int)*n);

    for(int i=0;i<n;i++){
        indexes[i] = i; 
    }

    while(1){
            result[index++] = sub_str(str , indexes,n);
            bool reset = true;
            for(int i=n-1;i>0;i--)
            {
                if( ((i==n-1)&&indexes[i]<m-1) ||  (indexes[i]<indexes[i+1]-1))
                {
                    indexes[i]++;
                    for(int j=i+1;j<n;j++) 
                        indexes[j] = indexes[j-1] + 1;
                    reset = false;
                    break;
                }
            }
            if(reset){
                indexes[0]++;
                if(indexes[0] + n > m) 
                    break;
                for(int i=1;i<n;i++)
                    indexes[i] = indexes[0]+i;
            }
    }
    num = index;
    return result;
}


int main(int argc, char* argv[])
{
    string str = "aeffgirq";
    string* r;
    int num;

    collection c;
    for(int i=8;i>0;i--){
        r = combination_m_n(str, str.size(),i ,num);
        for(int i=0;i<num;i++){
            int key = hash_f(r[i]);
             string temp = c.find(key);
            if(  temp != "" ){
                  cout << temp ;
            }
        }
    }
}



回答8:


Assuming a large dictionary and a letter set with less than 10 or 11 members (such as the example given), the fastest method is build a tree containing the possible words the letters can make, then match the word list against the tree. In other words your letter tree's root has seven subnodes: { a, e, f, g, i, r, q }. The branch of "a" has six subnodes { e, f, g, i, r, q }, etc. The tree thus contains every possible word which can be made with these letters.

Go through each word in the list and match it to the tree. If the match is maximum length (uses all the letters), you are done. If the word is less then max, but longer than any previously matched word, remember it, this is the "longest word so far" (LWSF). Ignore any words that have a length equal to less than the LWSF. Also, ignore any words which are longer than the length of the letter list.

This is a linear time algorithm once the letter tree is constructed, so as long as the word list is significantly larger than the letter tree, it is fastest method.




回答9:


The first thing to note is that you can completely ignore the letter order.

Have a trie (well, sort of a trie) as follows:

  • From the root, have 26 children (maximum), one for each letter.
  • From each non-root node have children equal to the number of letters greater or equal to the node's letter.
  • Have each node store all words that can be made using (exactly) the letters in the path from the root.

Build the trie like this:

For each word, sort the letters of this word and insert the sorted letters into the trie (by creating a path of these letters from the root), creating all required nodes as you go. And store the word at the final node.

How to do a look-up:

For a given set of letters, lookup all subsets of letter (most of which hopefully won't exist) and output the words at each node encountered.

Complexity:

O(k!), where k is the number of supplied letters. Eek! But luckely the less words there are in the trie, the less of the paths will exist and the less time this will take. And k is the number of supplied letters (which should be relatively small), not the number of words in the trie.

Actually it may be more along the lines of O(min(k!,n)), which looks a lot better. Note that if you're given enough letters, you'll have to look up all words, thus you have to do O(n) work in the worst case, so, in terms of the worst case complexity, you can't do much better.

Example:

Input:

aba
b
ad
da
la
ma

Sorted:

aab
b
ad
ad
al
am

Trie: (just showing non-null children)

     root
     /  \
    a    b
 /-/|\-\
a b d l m
|
b

Lookup of adb:

  • From the root...
  • Go to child a
    • Go to child b
      • No children, return
    • Go to child d
      • Output words at node - ad and da
      • No children, return
    • All letters processed, return
  • Go to child b
    • Output words at node - b
    • Not looking for a child, as only children >= b exists
    • No d child, return
  • No d child, stop


来源:https://stackoverflow.com/questions/16868941/find-the-longest-word-given-a-collection

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!