Calculating cosine similarity by featurizing the text into vector using tf-idf

限于喜欢 提交于 2019-12-04 15:02:27
mcelikkaya

I had nearly same problem. I had 370K row and 2 vectors of 300K and 400K for each row. I am multiplying test rdd rows with both these vectors.

There are 2 big improvements you can do. One is pre-calculate norms. They do not change. Second is use sparse vector. You go with vector.size it is 300K if you do it like that. If you use Sparse it is iterating for every keyword.(20-30 per row).

Also I am afraid this is most efficient way because calculations do not need to shuffle.If you have a good estimation at the end you can filter by score and things will be fast.(I mean which score is enough for you.)

def cosineSimilarity(vectorA: SparseVector, vectorB:SparseVector,normASqrt:Double,normBSqrt:Double) :(Double,Double) = {
  var dotProduct = 0.0
  for (i <-  vectorA.indices){ 
    dotProduct += vectorA(i) * vectorB(i)
  }
  val div = (normASqrt * normBSqrt)
  if( div == 0 )
    (dotProduct,0)
  else
    (dotProduct,dotProduct / div)
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!