Approximate String Matching in R

问题

for my research I have to match two data sets containing fund information. Unfortunately there is no common identifier. The good thing is that I have an identifier in both documents for the document number which however can contain multiple funds. If there are multiple funds in the document (e.g. 20) I can only match via the fund's name which can differ sometimes slightly. Note that the number of funds per document is identical in noth data sets. After searching a little bit I tried to use this function(found here: agrep: only return best match(es)):

ClosestMatch2 = function(string, stringVector){

  distance = levenshteinSim(string, stringVector);
  stringVector[distance == max(distance)]

}

This worked fine for most funds, however I discovered two problems:

Sometimes there are multiple matches
Sometimes I have wrong matches

For example: This function matched "INSTITUTIONAL LARGE CORE FUND" to "Transamerica Partners Institutional Core Bond" instead of "Transamerica Partners Institutional Large Core".

I have two ideas to circumvent these problems:

I use another matching function to verify the function above. I.e. I only accept matching if both function yield the same result.
I somehow adapt the function above.

I would really appreciate your help. Best, Laurenz

回答1:

The RecordLinkage package allows you to match strings with several approaches (e.g. levenshtein but also other measures) and it allows you to define thresholds or even the use of classification model to indicated when an match is ok for you.

来源：https://stackoverflow.com/questions/16145064/approximate-string-matching-in-r

标签

string-matching

levenshtein-distance