Tanimoto coefficient distance measure

北战南征 提交于 2019-12-23 11:39:48

问题


Can two objects have identical cosine and Tanimoto coefficient distance measure, where

Tanimoto distance measure, d(x,y) = x.y / (|x|*|x|) + (|y|*|y|)- x*y

and

cosine measure, d(x,y) = x.y /(|x|* |x|) * (|y| *|y|)

回答1:


The Tanimoto similarity coefficient (which is not a true distance measure) is defined by

d(x,y) = x.y / ((|x|*|x|) + (|y|*|y|)- x.y)

for bit vectors x and y.

Now compare that with the cosine similarity coefficent,

 d(x,y) = x.y / (|x| * |y|)

The denominators differ by a x.y term. The Tanimoto and cosine similarity coefficients would be the same if x.y is zero.

Geometrically, x.y is zero if and only if x and y are perpendicular.

Since x and y are bit vectors (i.e. whose values in each dimension can only be 0 or 1), x.y equalling zero means

x1*y1 + x2*y2 + ... + xn*yn = 0

If xi*yi = 1*1 = 1, then the whole sum would be positive. For the whole sum to be zero, no term xi*yi can equal 1. They must all equal 0:

So

x1*y1 = 0
x2*y2 = 0
...
xn*yn = 0

In other words, if xi is 1, then yi must be 0, and vice versa.

So there are tons of examples where the Tanimoto similarity is equal to the cosine similarity:

x = (0,1,0,1)
y = (1,0,0,0)

for instance.




回答2:


Even though the general form of Tanimoto distance was presented, you must always remember that, computationally, there is a binary form and continuous form.

The binary form is:

d(x,y) = n(X ∩ Y) / [ n(X) + n(Y) - n(X ∩ Y) ]

while the continuous form is:

d(x,y) = X.Y / (||X|| + ||Y|| - X.Y )

The difference is clear. If a coder is working for you, you must instruct them that n(X ∩ Y), n(X), n(Y) only involves counting the number of ones in the vectors. Whereas for ||X|| and ||Y|| you must state that the square root of (X1^2 + X2^2 + ... Xp^2) is required since ||X|| is the length of the vector X from the origin (also called the norm). Taking square roots for the binary form is unnecessary and would be computationally expensive (wasteful) for big data mining, since irrational math functions are expensive. However, for the continuous variant, you must use the square root.

In summary, always remember that for Tanimoto distance, there are two types: binary and continuous.



来源:https://stackoverflow.com/questions/26876105/tanimoto-coefficient-distance-measure

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!