Cosine Similarity

后端未结

关注

 3  2012

I calculated tf/idf values of two documents. The following are the tf/idf values:

1.txt
0.0
0.5
2.txt
0.0
0.5

The documents are like:

相关标签:

3条回答

梦谈多话

2020-12-13 05:44

            a * b
sim(a,b) =--------
           |a|*|b|

a*b is dot product

some details:

def dot(a,b):
  n = length(a)
  sum = 0
  for i in xrange(n):
    sum += a[i] * b[i];
  return sum

def norm(a):
  n = length(a)
  for i in xrange(n):
    sum += a[i] * a[i]
  return math.sqrt(sum)

def cossim(a,b):
  return dot(a,b) / (norm(a) * norm(b))

yes. to some extent, a and b must have the same length. but a and b usually have sparse representation, you only need to store non-zero entries and you can calculate norm and dot more fast.

0 讨论(0)

粉色の甜心

2020-12-13 05:52

simple java code implementation:

  static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
            Set<String> both = Sets.newHashSet(v1.keySet());
            both.retainAll(v2.keySet());
            double sclar = 0, norm1 = 0, norm2 = 0;
            for (String k : both) sclar += v1.get(k) * v2.get(k);
            for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
            for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
            return sclar / Math.sqrt(norm1 * norm2);
    }

0 讨论(0)

予麋鹿

2020-12-13 05:58
1) Calculate tf-idf ( Generally better than t-f alone but completely depends on your data set and requirement)

From wiki ( regarding idf )

An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

2) No , it is not important that both the documents have same number of words.

3) You can find tf-idf or cosine-similarity in any language now days by invoking some machine learning library function. I prefer python

Python code to calculate tf-idf and cosine-similarity ( using scikit-learn 0.18.2 )
```
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# example dataset
from sklearn.datasets import fetch_20newsgroups

# replace with your method to get data
example_data = fetch_20newsgroups(subset='all').data

max_features_for_tfidf = 10000
is_idf = True 

vectorizer = TfidfVectorizer(max_df=0.5, max_features=max_features_for_tf_idf,
                             min_df=2, stop_words='english',
                             use_idf=is_idf)


X_Mat = vectorizer.fit_transform(example_data)

# calculate cosine similarity between samples in X with samples in Y
cosine_sim = cosine_similarity(X=X_Mat, Y=X_Mat)
```
4) You might be interested in truncated Singular Value Decomposition (SVD)
0 讨论(0)
发布评论:

提交评论
- 加载中...