Cosines similarity on large data sets

元气小坏坏 提交于 2019-12-14 03:14:25

问题


Currently i'm studying about data-mining, text comparison and have found this one: https://en.wikipedia.org/wiki/Cosine_similarity.

Since i have successfully implemented this algorithm to compare two strings i have decided to try some more complex task to achieve. I have iterated over my DB which contains about 250k documents and compared one random document from DB to whole documents in that DB.

To compare all these items time was taken: 316.35898590088 sec, that's, - > 5 minutes to compare all 250k documents!

Due this results many issues have arisen and i wan't to ask some suggestions. For clarity first of all i'll describe some details which might be useful.

  • As programming language was chosen PHP.
  • Documents are stored inMySQL.
  • Implementation of cosines similarity function contains only this function, there's no stop words and any other fancy things.

Questions

  • Is there any way to achieve some better performance? Where i should start, by tuning algorithm ( i.e. in advance to prepare vectors, etc ), by using other technologies, etc?
  • How and where i should store these comparison results. For example i want to print some graphs where i can see all these 250k documents by similarity score so that I can identify which are most similar and so on.

回答1:


Both PHP and MySQL are about the worst choices you could have made.

Efficient cosine similarity is at the heart of Lucene. The key acceleration technique are comoressed inverted indexes. But you really don't want to reimplement them in PHP...



来源:https://stackoverflow.com/questions/31368527/cosines-similarity-on-large-data-sets

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!