URL path similarity/string similarity algorithm

My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process:

# GROUP 1
/robots.txt

# GROUP 2
/bot.html

# GROUP 3
/phpMyAdmin-2.5.6-rc1/scripts/setup.php
/phpMyAdmin-2.5.6-rc2/scripts/setup.php
/phpMyAdmin-2.5.6/scripts/setup.php
/phpMyAdmin-2.5.7-pl1/scripts/setup.php
/phpMyAdmin-2.5.7/scripts/setup.php
/phpMyAdmin-2.6.0-alpha/scripts/setup.php
/phpMyAdmin-2.6.0-alpha2/scripts/setup.php

# GROUP 4
//phpMyAdmin/

I tried Levenshtein distance to compare, but for me is not enough accurate. I do not need 100% accurate algorithm, but I think 90% and above is a must.

I think that I need some sort of classifier, but the problem is that each portion of new data can containt path that should be classified to the new unknown class.

Could you please direct me to the right thoutht?

Thanks

I know it's not the exact answer to your question, but are you familiar with k-means algorithm?

I guess even the Levenshtein can work here, the difficulty however is how to compute centroids with that approach.

Perhaps you can divide input set into disjoint subsets, then for each URL in each subset compute the distance to all the other URLs in the same subset, and the URL that has lowest sum of distances, should be the centroid (of course, it depends on how big is the input set; for huge sets it might be not a good idea to do so).

The good thing about k-means is that you can start with absolutely random division, and then iteratively make it better.

The bad thing about k-means is that you have to precise k before start. However, during the run (perhaps where the situation stabilized after first couple of iterations), you can measure intra-similarity of each set, and if it is low, you can divide the set into two subsets and go on with the same algorithm.

Levenshtein distance is best option, but tuned distance. You have to use weighted Edit distance and possibly split path on tokens - words and numbers. So for example version like "2.5.6-rc2 and 2.5.6" can be treated as 0 weight difference, but name token like phpMyAdmin and javaMyAdmin give 1 weight difference.

When checking @jakub.gieryluk suggestion I accidentally have found solution that satisfy me - "Hobohm clustering algorithm, originally devised to reduce redundancy of biological sequence data sets."

Tests of PERL library implemented by Bruno Vecchi gave me really good results. The only problem is that I need Python implementation, but I belive that I can either find one on the Internet or reimplement code by myself.

Next thing is that I have not checked active learning ability of this algorithm yet ;)

来源：https://stackoverflow.com/questions/7809134/url-path-similarity-string-similarity-algorithm

标签

algorithm

data-mining

classification

levenshtein-distance

text-mining