Potential Duplicates Detection, with 3 Severity Level

前端 未结 1 750
迷失自我
迷失自我 2020-12-11 12:48

I wanna make a program that detect a potential duplicates with 3 severity level. let consider my data is only in two column, but with thousands row. data in second column d

相关标签:
1条回答
  • 2020-12-11 13:05

    Really, it depends how you want to define "severity level". Here's one way to do it, not necessarily the best: Use the Levensthein distance.

    Represent each of your items by a one-character attribute symbol, e.g.

    H    helmet
    K    knight
    I    iron
    $    Leather
    ^    Valros
    ╔    Plain
    ¢    Whatever
    etc.
    

    Then convert your Material lists into a string containing sequence of characters representing these attributes:

    HIK = helmet,iron,knight
    ¢H  = plain,helmet
    

    Then compute the Levenshtein distance between those two strings. That will be your "severity level".

    Debug.Print LevenshteinDistance("HIK","¢H")
    'returns 3
    

    Two implementations of the Levenshtein distance are shown in Wikipedia. And indeed you are in luck: someone on StackOverflow ported this to VBA.

    In the comments section below, you say you don't like having to represent each of your possible attributes by one-character symbols. Fair enough; I agree this is a bit silly. Workaround: It is, in fact, possible to adapt the Levenshtein Distance algorithm to look not at each character in a string, but at each element of an array instead, and do comparisons based on that. I show how to make this change in my answer to your follow-up question.

    0 讨论(0)
提交回复
热议问题