Cosine Similarity

后端 未结 3 2005
我在风中等你
我在风中等你 2020-12-13 05:14

I calculated tf/idf values of two documents. The following are the tf/idf values:

1.txt
0.0
0.5
2.txt
0.0
0.5

The documents are like:

相关标签:
3条回答
  • 2020-12-13 05:44
                a * b
    sim(a,b) =--------
               |a|*|b|
    

    a*b is dot product

    some details:

    def dot(a,b):
      n = length(a)
      sum = 0
      for i in xrange(n):
        sum += a[i] * b[i];
      return sum
    
    def norm(a):
      n = length(a)
      for i in xrange(n):
        sum += a[i] * a[i]
      return math.sqrt(sum)
    
    def cossim(a,b):
      return dot(a,b) / (norm(a) * norm(b))
    

    yes. to some extent, a and b must have the same length. but a and b usually have sparse representation, you only need to store non-zero entries and you can calculate norm and dot more fast.

    0 讨论(0)
  • 2020-12-13 05:52

    simple java code implementation:

      static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
                Set<String> both = Sets.newHashSet(v1.keySet());
                both.retainAll(v2.keySet());
                double sclar = 0, norm1 = 0, norm2 = 0;
                for (String k : both) sclar += v1.get(k) * v2.get(k);
                for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
                for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
                return sclar / Math.sqrt(norm1 * norm2);
        }
    
    0 讨论(0)
  • 2020-12-13 05:58

    1) Calculate tf-idf ( Generally better than t-f alone but completely depends on your data set and requirement)

    From wiki ( regarding idf )

    An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

    2) No , it is not important that both the documents have same number of words.

    3) You can find tf-idf or cosine-similarity in any language now days by invoking some machine learning library function. I prefer python

    Python code to calculate tf-idf and cosine-similarity ( using scikit-learn 0.18.2 )

    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    # example dataset
    from sklearn.datasets import fetch_20newsgroups
    
    # replace with your method to get data
    example_data = fetch_20newsgroups(subset='all').data
    
    max_features_for_tfidf = 10000
    is_idf = True 
    
    vectorizer = TfidfVectorizer(max_df=0.5, max_features=max_features_for_tf_idf,
                                 min_df=2, stop_words='english',
                                 use_idf=is_idf)
    
    
    X_Mat = vectorizer.fit_transform(example_data)
    
    # calculate cosine similarity between samples in X with samples in Y
    cosine_sim = cosine_similarity(X=X_Mat, Y=X_Mat)
    

    4) You might be interested in truncated Singular Value Decomposition (SVD)

    0 讨论(0)
提交回复
热议问题