Hashing a Tree Structure

后端 未结 11 1998
渐次进展
渐次进展 2020-11-28 04:00

I\'ve just come across a scenario in my project where it I need to compare different tree objects for equality with already known instances, and have considered that some so

相关标签:
11条回答
  • 2020-11-28 04:11
    class TreeNode
    {
      public static QualityAgainstPerformance = 3; // tune this for your needs
      public static PositionMarkConstan = 23498735; // just anything
      public object TargetObject; // this is a subject of this TreeNode, which has to add it's hashcode;
    
      IEnumerable<TreeNode> GetChildParticipiants()
      {
       yield return this;
    
       foreach(var child in Children)
       {
        yield return child;
    
        foreach(var grandchild in child.GetParticipiants() )
         yield return grandchild;
      }
      IEnumerable<TreeNode> GetParentParticipiants()
      {
       TreeNode parent = Parent;
       do
        yield return parent;
       while( ( parent = parent.Parent ) != null );
      }
      public override int GetHashcode()
      {
       int computed = 0;
       var nodesToCombine =
        (Parent != null ? Parent : this).GetChildParticipiants()
         .Take(QualityAgainstPerformance/2)
        .Concat(GetParentParticipiants().Take(QualityAgainstPerformance/2));
    
       foreach(var node in nodesToCombine)
       {
        if ( node.ReferenceEquals(this) )
          computed = AddToMix(computed, PositionMarkConstant );
        computed = AddToMix(computed, node.GetPositionInParent());
        computed = AddToMix(computed, node.TargetObject.GetHashCode());
       }
       return computed;
      }
    }
    

    AddToTheMix is a function, which combines the two hashcodes, so the sequence matters. I don't know what it is, but you can figure out. Some bit shifting, rounding, you know...

    The idea is that you have to analyse some environment of the node, depending on the quality you want to achieve.

    0 讨论(0)
  • 2020-11-28 04:12

    Any time you are working with trees recursion should come to mind:

    public override int GetHashCode() {
        int hash = 5381;
        foreach(var node in this.BreadthFirstTraversal()) {
            hash = 33 * hash + node.GetHashCode();
        }
    }
    

    The hash function should depend on the hash code of every node within the tree as well as its position.

    Check. We are explicitly using node.GetHashCode() in the computation of the tree's hash code. Further, because of the nature of the algorithm, a node's position plays a role in the tree's ultimate hash code.

    Reordering the children of a node should distinctly change the resulting hash code.

    Check. They will be visited in a different order in the in-order traversal leading to a different hash code. (Note that if there are two children with the same hash code you will end up with the same hash code upon swapping the order of those children.)

    Reflecting any part of the tree should distinctly change the resulting hash code

    Check. Again the nodes would be visited in a different order leading to a different hash code. (Note that there are circumstances where the reflection could lead to the same hash code if every node is reflected into a node with the same hash code.)

    0 讨论(0)
  • 2020-11-28 04:16

    If I were to do this, I'd probably do something like the following:

    For each leaf node, compute the concatenation of 0 and the hash of the node data.

    For each internal node, compute the concatenation of 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.

    This will lead to a cascade up the tree every time you change anything, but that MAY be low-enough of an overhead to be worthwhile. If changes are relatively infrequent compared to the amount of changes, it may even make sense to go for a cryptographically secure hash.

    Edit1: There is also the possibility of adding a "hash valid" flag to each node and simply propagate a "false" up the tree (or "hash invalid" and propagate "true") up the tree on a node change. That way, it may be possible to avoid a complete recalculation when the tree hash is needed and possibly avoid multiple hash calculations that are not used, at the risk of slightly less predictable time to get a hash when needed.

    Edit3: The hash code suggested by Noldorin in the question looks like it would have a chance of collisions, if the result of GetHashCode can ever be 0. Essentially, there is no way of distinguishing a tree composed of a single node, with "symbol hash" 30 and "value hash" 25 and a two-node tree, where the root has a "symbol hash" of 0 and a "value hash" of 30 and the child node has a total hash of 25. The examples are entirely invented, I don't know what expected hash ranges are so I can only comment on what I see in the presented code.

    Using 31 as the multiplicative constant is good, in that it will cause any overflow to happen on a non-bit boundary, although I am thinking that, with sufficient children and possibly adversarial content in the tree, the hash contribution from items hashed early MAY be dominated by later hashed items.

    However, if the hash performs decently on expected data, it looks as if it will do the job. It's certainly faster than using a cryptographic hash (as done in the example code listed below).

    Edit2: As for specific algorithms and minimum data structure needed, something like the following (Python, translating to any other language should be relatively easy).

    #! /usr/bin/env  python
    
    import Crypto.Hash.SHA
    
    class Node:
        def __init__ (self, parent=None, contents="", children=[]):
            self.valid = False
            self.hash = False
            self.contents = contents
            self.children = children
    
    
        def append_child (self, child):
            self.children.append(child)
    
            self.invalidate()
    
        def invalidate (self):
            self.valid = False
            if self.parent:
                self.parent.invalidate()
    
        def gethash (self):
            if self.valid:
                return self.hash
    
            digester = crypto.hash.SHA.new()
    
            digester.update(self.contents)
    
            if self.children:
                for child in self.children:
                    digester.update(child.gethash())
                self.hash = "1"+digester.hexdigest()
            else:
                self.hash = "0"+digester.hexdigest()
    
            return self.hash
    
        def setcontents (self):
            self.valid = False
            return self.contents
    
    0 讨论(0)
  • 2020-11-28 04:17

    The usual technique of hashing any sequence is combining the values (or hashes thereof) of its elements in some mathematical way. I don't think a tree would be any different in this respect.

    For example, here is the hash function for tuples in Python (taken from Objects/tupleobject.c in the source of Python 2.6):

    static long
    tuplehash(PyTupleObject *v)
    {
        register long x, y;
        register Py_ssize_t len = Py_SIZE(v);
        register PyObject **p;
        long mult = 1000003L;
        x = 0x345678L;
        p = v->ob_item;
        while (--len >= 0) {
            y = PyObject_Hash(*p++);
            if (y == -1)
                return -1;
            x = (x ^ y) * mult;
            /* the cast might truncate len; that doesn't change hash stability */
            mult += (long)(82520L + len + len);
        }
        x += 97531L;
        if (x == -1)
            x = -2;
        return x;
    }
    

    It's a relatively complex combination with constants experimentally chosen for best results for tuples of typical lengths. What I'm trying to show with this code snippet is that the issue is very complex and very heuristic, and the quality of the results probably depend on the more specific aspects of your data - i.e. domain knowledge may help you reach better results. However, for good-enough results you shouldn't look too far. I would guess that taking this algorithm and combining all the nodes of the tree instead of all the tuple elements, plus adding their position into play will give you a pretty good algorithm.

    One option of taking the position into account is the node's position in an inorder walk of the tree.

    0 讨论(0)
  • 2020-11-28 04:20

    Writing your own hash function is almost always a bug, because you basically need a degree in mathematics to do it well. Hashfunctions are incredibly nonintuitive, and have highly unpredictable collision characteristics.

    Don't try directly combining hashcodes for Child nodes -- this will magnify any problems in the underlying hash functions. Instead, concatenate the raw bytes from each node in order, and feed this as a byte stream to a tried-and-true hash function. All the cryptographic hash functions can accept a byte stream. If the tree is small, you may want to just create a byte array and hash it in one operation.

    0 讨论(0)
  • 2020-11-28 04:24

    A simple enumeration (in any deterministic order) together with a hash function that depends when the node is visited should work.

    int hash(Node root) {
      ArrayList<Node> worklist = new ArrayList<Node>();
      worklist.add(root);
      int h = 0;
      int n = 0;
      while (!worklist.isEmpty()) {
        Node x = worklist.remove(worklist.size() - 1);
        worklist.addAll(x.children());
        h ^= place_hash(x.hash(), n);
        n++;
      }
      return h;
    }
    
    int place_hash(int hash, int place) {
      return (Integer.toString(hash) + "_" + Integer.toString(place)).hash();
    }
    
    0 讨论(0)
提交回复
热议问题