How to find all subsets of a multiset that are in a given set?

问题

Say I have a set D of multisets:

D = {
  {d, g, o},
  {a, e, t, t},
  {a, m, t},
}

Given a multiset M, like

M = {a, a, m, t}

I would like an algorithm f to give me all elements of D that are subsets (or more precisely, "submultisets") of M:

f = {{a, m, t}}

If we do only one such query, scanning over all elements of D (in O(#D) time) is clearly optimal. But if we want to answer many such queries for the same D and different M, we might be able to make it faster by preprocessing D into some smarter data structure.

We could toss all of D into a hashtable and iterate over all possible subsets of M, looking each up in the hashtable, but that's O(2^#M). Fine for small M, not so fine for medium to large M.

Is it possible to do this in polynomial time in #M? Or maybe to reduce a known NP-complete problem to this one, to prove that it's impossible to be fast?

Edit: I just realized that in the worst case, we need to output all of D, so #D is still going to appear in the time complexity. So let's assume that the size of the output is bounded by some constant.

回答1:

Here is a quick implementation of TernarySearchTree (TST) which can help in your problem. A number of years ago, I was inspired by an article in DrDobbs. You can read more about it at http://www.drdobbs.com/database/ternary-search-trees/184410528. It provides some background about TST and performance analysis.

In your problem description example, D would be your dictionary containing "dgo","aett" and "amt" keys. The values are identical to the keys.

M is your search string which basically says "Give me all the values in the dictionary with keys containing a subset or all of these alphabets". The order of the characters are not important. The character '.' is used as wildcard in the search.

For any given M, this algorithm and data structure does not require you to look at all elements of D. So in that respect it will be fast. I have also done some tests on the number of nodes visited and most of the time, the number of nodes visited is just a small fraction of the total nodes in the dictionary even for keys that are not found.

This algorithm also optionally allows you to enter the minimum and maximum length which limits the keys returned by the dictionary.

Sorry for the lengthy code but it is complete for you to be able to test.

import java.util.ArrayList;
import java.io.*;

public class TSTTree<T>
{
    private TSTNode<T> root;
    private int size = 0;
    private int totalNodes = 0;

    public int getSize() { return size; }

    public int getTotalNodes() { return totalNodes; }

    public TSTTree()
    {
    }

    public TSTNode<T> getRoot() { return root; }

    public void insert(String key, T value)
    {
        if(key==null || key.length()==0) return;

        char[] keyArray = key.toCharArray();

        if(root==null) root = new TSTNode<T>(keyArray[0]);
        TSTNode<T> currentNode = root;
        TSTNode<T> parentNode = null;

        int d = 0;
        int i = 0;

        while(currentNode!=null)
        {
            parentNode = currentNode;
            d = keyArray[i] - currentNode.getSplitChar();
            if(d==0)
            {
                if((++i) == keyArray.length) // Found the key
                {
                    if(currentNode.getValue()!=null)
                        System.out.println(currentNode.getValue() + " replaced with " + value);
                    else
                        size++;
                    currentNode.setValue(value);        // Replace value at Node
                    return;
                }
                else
                    currentNode = currentNode.getEqChild();
            }
            else if(d<0)
                currentNode = currentNode.getLoChild();
            else
                currentNode = currentNode.getHiChild();
        }

        currentNode = new TSTNode<T>(keyArray[i++]);
        totalNodes++;
        if(d==0)
            parentNode.setEqChild(currentNode);
        else if(d<0)
            parentNode.setLoChild(currentNode);
        else
            parentNode.setHiChild(currentNode);

        for(;i<keyArray.length;i++)
        {
            TSTNode<T> tempNode = new TSTNode<T>(keyArray[i]);
            totalNodes++;
            currentNode.setEqChild(tempNode);
            currentNode = tempNode;
        }

        currentNode.setValue(value);        // Insert value at Node
        size++;
    }

    public ArrayList<T> find(String charsToSearch) {
        return find(charsToSearch,1,charsToSearch.length());
    }

    // Return all values with keys between minLen and maxLen containing "charsToSearch".
    public ArrayList<T> find(String charsToSearch, int minLen, int maxLen) {
        ArrayList<T> list = new ArrayList<T>();
        char[] charArray = charsToSearch.toCharArray();
        int[] charFreq = new int[256];
        for(int i=0;i<charArray.length;i++) charFreq[charArray[i]]++;
        maxLen = charArray.length>maxLen ? maxLen : charArray.length;
        pmsearch(root,charFreq,minLen,maxLen,1, list);
        return list;
    }

    public void pmsearch(TSTNode<T> node, int[] charFreq, int minLen, int maxLen, int depth, ArrayList<T> list) {
        if(node==null) return;

        char c = node.getSplitChar();
        if(isSmaller(charFreq,c))
            pmsearch(node.getLoChild(),charFreq,minLen,maxLen,depth,list);
        if(charFreq[c]>0) {
            if(depth>=minLen && node.getValue()!=null) list.add(node.getValue());
            if(depth<maxLen) {
                charFreq[c]--;
                pmsearch(node.getEqChild(),charFreq,minLen,maxLen,depth+1,list);
                charFreq[c]++;
            }
        }
        else if(charFreq['.']>0) { // Wildcard
            if(depth>=minLen && node.getValue()!=null) list.add(node.getValue());
            if(depth<maxLen) {
                charFreq['.']--;
                pmsearch(node.getEqChild(),charFreq,minLen,maxLen,depth+1,list);
                charFreq['.']++;
            }
        }            
        if(isGreater(charFreq,c))
            pmsearch(node.getHiChild(),charFreq,minLen,maxLen,depth,list);
    }

    private boolean isGreater(int[] charFreq, char c) {
        if(charFreq['.']>0) return true;

        boolean retval = false;
        for(int i=c+1;i<charFreq.length;i++) {
            if(charFreq[i]>0) {
                retval = true;
                break;
            }
        }
        return retval;
    }

    private boolean isSmaller(int[] charFreq, char c) {
        if(charFreq['.']>0) return true;

        boolean retval = false;
        for(int i=c-1;i>-1;i--) {
            if(charFreq[i]>0) {
                retval = true;
                break;
            }
        }
        return retval;
    }
}

Below is a small test program. The test program just inserts the 4 key/value pairs in your example in the exact order. If you have a D with a lot of elements, then it would be best to sort it first and build the dictionary in a tournament fashion (ie. insert middle element, then recursively populate left half and right half). This will ensure the tree is balanced.

import org.junit.*;
import org.junit.runner.*;
import java.io.*;
import java.util.*;
import org.junit.runner.notification.Failure;

public class MyTest
{
    static TSTTree<String> dictionary = new TSTTree<String>();

    @BeforeClass
    public static void initialize() {
        dictionary.insert("dgo","dgo");
        dictionary.insert("aett","aett");
        dictionary.insert("amt","amt");
    }

    @Test
    public void testMethod() {
        System.out.println("testMethod");
        ArrayList<String> result = dictionary.find("aamt");
        System.out.println("Results: ");
        for(Iterator it=result.iterator();it.hasNext();) {
            System.out.println(it.next());
        }
    }

    @Test
    // Test with a wildcard which finds "dgo" value
    public void testMethod1() {
        System.out.println("testMethod1");
        ArrayList<String> result = dictionary.find("aamtdo.");
        System.out.println("Results: ");
        for(Iterator it=result.iterator();it.hasNext();) {
            System.out.println(it.next());
        }
    }

    public static void main(String[] args) {
        Result result = JUnitCore.runClasses(MyTest.class);
        for (Failure failure : result.getFailures()) {
        System.out.println(failure.toString());
        }
    }
}

来源：https://stackoverflow.com/questions/22223651/how-to-find-all-subsets-of-a-multiset-that-are-in-a-given-set

标签

algorithm

language-agnostic

set

subset

multiset