Algorithm to decide cut-off for collapsing this tree?

允我心安 提交于 2019-12-02 23:21:39

I think I'd need to know more before I can give specific suggestions. But maybe this will help. I'm assuming that each terminal node is a sequence, and each internal node is a PSSM.

The calculation for X is application specific. For example, the X you get if you want to collapse ultraparalogs isn't the same as the X you get when you want to collapse all homologs.

Since genes are continuously being created via duplication and speciation, there isn't a single value for X that will discriminate sequences by evolutionary relationship. Therefore, I don't expect you'll find a satisfying proxy for determining evolutionary relationships between sequences by looking only at cluster statistics.

A more rigorous method would build a gene tree from the gene of each regulatory motif and reconcile it with a species tree. There's software out there and additional heuristics to ortholog / inparalog identification.

If you do this, the internal nodes of your tree will be decorated with the inferred evolutionary event (e.g., duplication, speciation). Then you can walk up the tree collapsing nodes for clades you don't care about.

You could try and use something similar to tree reconciliation as @Jeff mentioned. But standard tree reconciliation will actually fail.

Reconciliation involves firstly adding branches that represent "losses" of evolutionary characters throughout the target tree. Then indicating the nodes at which "duplications" of evolutionary characters have occurred. The weighted sum of losses and duplications provide a cost function to optimise for.

But in your case, the problem you want to solve is "break this super-tree into appropriately sized, orthologous sub-trees". This means you don't really want to score losses as much as you would duplications. You want a way to score the tree such that it reveals how many orthologous sub-trees are merged into your super-tree. Thus you can try this scoring approach:

  1. Take a super-tree, count the number of duplicate species, S1.
  2. Collapse all terminal leaves that are paralogues and count the new number of duplicate species, S2.
  3. The difference between S1 and S2 reveals approximately how many sub-trees you have in the super-tree.
  4. To correct for any bias caused by various sized super-trees divide by the number of unique species represented in the super-tree N.

If we call this score the "sub-tree factor" then it equates to:

S1 - S2 / N

Inferences:

  • If S1 - S2 = S1 then it means your super-tree has approximately one true sub-tree within it, that all multiple species occurrences were just due to recent paralogues.

  • If S1 - S2 = 0 then it means your super-tree has approximately S1 true sub-trees within it.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!