How to compute the Topological Overlap Measure [TOM] for a weighted adjacency matrix in Python?

萝らか妹 提交于 2020-06-25 02:41:21

问题


I'm trying to calculate the weighted topological overlap for an adjacency matrix but I cannot figure out how to do it correctly using numpy. The R function that does the correct implementation is from WGCNA (https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity). The formula for computing this (I THINK) is detailed in equation 4 which I believe is correctly reproduced below.

Does anyone know how to implement this correctly so it reflects the WGCNA version?

Yes, I know about rpy2 but I'm trying to go lightweight on this if possible.

For starters, my diagonal is not 1 and the values have no consistent error from the original (e.g. not all off by x).

When I computed this in R, I used the following:

> library(WGCNA, quiet=TRUE)
> df_adj = read.csv("https://pastebin.com/raw/sbAZQsE6", row.names=1, header=TRUE, check.names=FALSE, sep="\t")
> df_tom = TOMsimilarity(as.matrix(df_adj), TOMType="unsigned", TOMDenom="min")
# ..connectivity..
# ..matrix multiplication (system BLAS)..
# ..normalization..
# ..done.
# I've uploaded it to this url: https://pastebin.com/raw/HT2gBaZC

I'm not sure where my code is incorrect. The source code for the R version is here but it's using C backend scripts? which is very difficult for me interpret.

Here is my implementation in Python :

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

def get_iris_data():
    iris = load_iris()
    # Iris dataset
    X = pd.DataFrame(iris.data,
                     index = [*map(lambda x:f"iris_{x}", range(150))],
                     columns = [*map(lambda x: x.split(" (cm)")[0].replace(" ","_"), iris.feature_names)])

    y = pd.Series(iris.target,
                           index = X.index,
                           name = "Species")
    return X, y

# Get data
X, y = get_iris_data()

# Create an adjacency network
# df_adj = np.abs(X.T.corr()) # I've uploaded this part to this url: https://pastebin.com/raw/sbAZQsE6
df_adj = pd.read_csv("https://pastebin.com/raw/sbAZQsE6", sep="\t", index_col=0)
A_adj = df_adj.values

# Correct TOM from WGCNA for the A_adj
# See above for code
# https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity
df_tom__wgcna = pd.read_csv("https://pastebin.com/raw/HT2gBaZC", sep="\t", index_col=0)

# My attempt
A = A_adj.copy()
dimensions = A.shape
assert dimensions[0] == dimensions[1]
d = dimensions[0]

# np.fill_diagonal(A, 0)

# Equation (4) from http://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf
A_tom = np.zeros_like(A)
for i in range(d):
    a_iu = A[i]
    k_i = a_iu.sum()
    for j in range(i+1, d):
        a_ju = A[:,j]
        k_j = a_ju.sum()
        l_ij = np.dot(a_iu, a_ju)
        a_ij = A[i,j]
        numerator = l_ij + a_ij
        denominator = min(k_i, k_j) + 1 - a_ij
        w_ij = numerator/denominator
        A_tom[i,j] = w_ij
A_tom = (A_tom + A_tom.T)

There is a package called GTOM (https://github.com/benmaier/gtom) but it is not for weighted adjacencies. The author of GTOM also took a look at this problem (which a much more sophisticated/efficient NumPy implementation but it's still not producing the expected results).

Does anyone know how to reproduce the WGCNA implementation?

EDIT: 2019.06.20 I've adapted some of the code from @scleronomic and @benmaier with credits in the doc string. The function is available in soothsayer from v2016.06 and on. Hopefully this will allow people to use topological overlap in Python easier instead of only being able to use R.

https://github.com/jolespin/soothsayer/blob/master/soothsayer/networks/networks.py

import numpy as np
import soothsayer as sy
df_adj = sy.io.read_dataframe("https://pastebin.com/raw/sbAZQsE6")
df_tom = sy.networks.topological_overlap_measure(df_adj)
df_tom__wgcna = sy.io.read_dataframe("https://pastebin.com/raw/HT2gBaZC")
np.allclose(df_tom, df_tom__wgcna)
# True

回答1:


First let's look at the parts of the equation for the case of a binary adjacency matrix a_ij:

  • a_ij: indicates if node i is connected to node j
  • k_i: count of the neighbors of node i (connectivity)
  • l_ij: count of the common neighbors of node i and node j

so w_ij measures how many of the neighbors of the node with the lower connectivity are also neighbors of the other node (ie. w_ij measures "their relative inter-connectedness").

My guess is that they define the diagonal of A to be zero instead of one. With this assumption I can reproduce the values of WGCNA.

A[range(d), range(d)] = 0  # Assumption
L = A @ A  # Could be done smarter by using the symmetry
K = A.sum(axis=1)

A_tom = np.zeros_like(A)
for i in range(d):
    for j in range(i+1, d):  
        numerator = L[i, j] + A[i, j]
        denominator = min(K[i], K[j]) + 1 - A[i, j]
        A_tom[i, j] = numerator / denominator
    
A_tom += A_tom.T
A_tom[range(d), range(d)] = 1  # Set diagonal to 1 by default

A_tom__wgcna = np.array(pd.read_csv("https://pastebin.com/raw/HT2gBaZC", 
                        sep="\t", index_col=0))
print(np.allclose(A_tom, A_tom__wgcna))

An intuition why the diagonal of A should be zero instead of one can be seen for a simple example with a binary A:

 Graph      Case Zero    Case One
   B          A B C D      A B C D  
 /   \      A 0 1 1 1    A 1 1 1 1  
A-----D     B 1 0 0 1    B 1 1 0 1  
 \   /      C 1 0 0 1    C 1 0 1 1  
   C        D 1 1 1 0    D 1 1 1 1  

The given description of equation 4 explains:

Note that w_ij = 1 if the node with fewer connections satisfies two conditions:

  • (a) all of its neighbors are also neighbors of the other node and
  • (b) it is connected to the other node.

In contrast, w_ij = 0 if i and j are un-connected and the two nodes do not share any neighbors.

So the connection between A-D should fulfill this criterion and be `w_14=11.

  • Case Zero:
  • Case One:

What is still missing is that the diagonal values don't match. I set them to one by default. What is the inter-connectedness of a node with itself? A value different than one (or zero, depending on definition) doesn't make sense to me. Neither Case Zero nor Case One result in w_ii=1 in the simple example. In Case Zero it would be necessary that k_i+1 == l_ii, and in Case One it would be necessary that k_i == l_ii+1, which both seems wrong to me.

So to summarize I would set the diagonal of the adjacency matrix to zero, use the given equation and set the diagonal of the result to one by default.



来源:https://stackoverflow.com/questions/56574729/how-to-compute-the-topological-overlap-measure-tom-for-a-weighted-adjacency-ma

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!