get cosine similarity between two documents in lucene

后端 未结 7 2139
野性不改
野性不改 2020-11-27 03:13

i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index.

7条回答
  •  谎友^
    谎友^ (楼主)
    2020-11-27 04:13

    If you don't need to store documents to Lucene and just want to calculate similarity between two docs, here's the faster code (Scala, from my blog http://chepurnoy.org/blog/2014/03/faster-cosine-similarity-between-two-dicuments-with-scala-and-lucene/ )

    def extractTerms(content: String): Map[String, Int] = {    
         val analyzer = new StopAnalyzer(Version.LUCENE_46)
         val ts = new EnglishMinimalStemFilter(analyzer.tokenStream("c", content))
         val charTermAttribute = ts.addAttribute(classOf[CharTermAttribute])
    
         val m = scala.collection.mutable.Map[String, Int]()
    
         ts.reset()
         while (ts.incrementToken()) {
             val term = charTermAttribute.toString
             val newCount = m.get(term).map(_ + 1).getOrElse(1)
             m += term -> newCount       
         }
    
         m.toMap
     }
    
    def similarity(t1: Map[String, Int], t2: Map[String, Int]): Double = {
         //word, t1 freq, t2 freq
         val m = scala.collection.mutable.HashMap[String, (Int, Int)]()
    
         val sum1 = t1.foldLeft(0d) {case (sum, (word, freq)) =>
             m += word ->(freq, 0)
             sum + freq
         }
    
         val sum2 = t2.foldLeft(0d) {case (sum, (word, freq)) =>
             m.get(word) match {
                 case Some((freq1, _)) => m += word ->(freq1, freq)
                 case None => m += word ->(0, freq)
             }
             sum + freq
         }
    
         val (p1, p2, p3) = m.foldLeft((0d, 0d, 0d)) {case ((s1, s2, s3), e) =>
             val fs = e._2
             val f1 = fs._1 / sum1
             val f2 = fs._2 / sum2
             (s1 + f1 * f2, s2 + f1 * f1, s3 + f2 * f2)
         }
    
         val cos = p1 / (Math.sqrt(p2) * Math.sqrt(p3))
         cos
     }  
    

    So, to calculate similarity between text1 and text2 just call similarity(extractTerms(text1), extractTerms(text2))

提交回复
热议问题