Finding the most frequent subtrees in a collection of (parse) trees

前提是你 提交于 2019-12-01 12:12:02

Finding the most frequent subtrees in the collection, create a compact form of the subtree, then iterate every subtree and use a hashset to count their occurrences. 30 nodes is too big for a perfect hash - it's only about one bit per node, and you need that much to indicate whether it's a sibling or a child.

That problem isn't LCS - the most common sequence isn't related to the longest common subsequence. The most frequent subtree is that which occurs the most.

It should be at worst case O(N L^2) for N trees of length L (assuming testing equality of a subtree containing L nodes is O(L)).

I think, although you say that performance isn't yet an issue, this is an NP-hard problem, so it may never be possible to make it fast. If I've understood correctly, you can consider this a variant of the Longest common subsequence problem; if you flatten your tree into a straight sequence like

(nounphrase)(DOWN)(article:the)(adjective:big)(adjective:black)(noun:cat)(UP)

Then your problem becomes LCS.

Wikibooks has a java implementation of LCS here

This is a well-known problem in computer science, for which there are efficient solutions.

Here are some relevant references:

Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Setsuo Arikawa, Optimized Substructure Discovery for Semi-structured Data, Proc. 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2002), LNAI 2431, Springer-Verlag, 1-14, August 2002.

Mohammed J. Zaki, Efficiently Mining Frequent Trees in a Forest, 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2002.

Or, if you just want fast code, go here: FREQT (transforming xml to S-expressions shouldn't give you too much problems, and is left as an exercise for the reader)

I found tool called gspan very useful in this case. Its available for free download at http://www.cs.ucsb.edu/~xyan/software/gSpan.htm . Its c++ version with matlab interface is at http://www.nowozin.net/sebastian/gboost/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!