Normalizing by max value or by total value?

守給你的承諾、 提交于 2019-12-05 12:12:25

Notation

Suppose you have two vectors A and B, you use x as the normalization constant for A and y as the normalization constant for B. Since you are counting word occurrences, we can assume x > 0 and y > 0.

Cosine distance

For cosine distance showing below, normalization constant will be canceled out. It's easy to see, you will finally get a constant 1/(xy) at the enumerator, and an identical constant 1/(xy) at the denominator . So you can cancel out 1/(xy).

Euclidean distance

For Euclidean distance, it's not the case above. I list an example below assuming A and B are 2-d vectors. n-dimensional vector is just a simple extension on that. A' and B' are the normalized vector of A and B respectively.

Comparing the unnormalized version of dist(A,B) with the normalized version of dist(A',B'), you can see that: the normalization constant you choose (max or sum) determines the weight on x1^2+x2^2, y1^2+y2^2 and the interacting term. As a result, different normalization constants give you different distances.

Feature Vector

If this is for some information retrieval purpose or topic extracting, did you try TF-IDF? That might be a better measure than purely counting the occurrences of terms.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!