URL path similarity/string similarity algorithm

為{幸葍}努か 提交于 2019-12-04 14:49:55

I know it's not the exact answer to your question, but are you familiar with k-means algorithm?

I guess even the Levenshtein can work here, the difficulty however is how to compute centroids with that approach.

Perhaps you can divide input set into disjoint subsets, then for each URL in each subset compute the distance to all the other URLs in the same subset, and the URL that has lowest sum of distances, should be the centroid (of course, it depends on how big is the input set; for huge sets it might be not a good idea to do so).

The good thing about k-means is that you can start with absolutely random division, and then iteratively make it better.

The bad thing about k-means is that you have to precise k before start. However, during the run (perhaps where the situation stabilized after first couple of iterations), you can measure intra-similarity of each set, and if it is low, you can divide the set into two subsets and go on with the same algorithm.

Levenshtein distance is best option, but tuned distance. You have to use weighted Edit distance and possibly split path on tokens - words and numbers. So for example version like "2.5.6-rc2 and 2.5.6" can be treated as 0 weight difference, but name token like phpMyAdmin and javaMyAdmin give 1 weight difference.

When checking @jakub.gieryluk suggestion I accidentally have found solution that satisfy me - "Hobohm clustering algorithm, originally devised to reduce redundancy of biological sequence data sets."

Tests of PERL library implemented by Bruno Vecchi gave me really good results. The only problem is that I need Python implementation, but I belive that I can either find one on the Internet or reimplement code by myself.

Next thing is that I have not checked active learning ability of this algorithm yet ;)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!