Approximate String Matching in R

a 夏天 提交于 2019-12-20 04:21:03

问题


for my research I have to match two data sets containing fund information. Unfortunately there is no common identifier. The good thing is that I have an identifier in both documents for the document number which however can contain multiple funds. If there are multiple funds in the document (e.g. 20) I can only match via the fund's name which can differ sometimes slightly. Note that the number of funds per document is identical in noth data sets. After searching a little bit I tried to use this function(found here: agrep: only return best match(es)):

ClosestMatch2 = function(string, stringVector){

  distance = levenshteinSim(string, stringVector);
  stringVector[distance == max(distance)]

}

This worked fine for most funds, however I discovered two problems:

  1. Sometimes there are multiple matches
  2. Sometimes I have wrong matches

For example: This function matched "INSTITUTIONAL LARGE CORE FUND" to "Transamerica Partners Institutional Core Bond" instead of "Transamerica Partners Institutional Large Core".

I have two ideas to circumvent these problems:

  1. I use another matching function to verify the function above. I.e. I only accept matching if both function yield the same result.
  2. I somehow adapt the function above.

I would really appreciate your help. Best, Laurenz


回答1:


The RecordLinkage package allows you to match strings with several approaches (e.g. levenshtein but also other measures) and it allows you to define thresholds or even the use of classification model to indicated when an match is ok for you.



来源:https://stackoverflow.com/questions/16145064/approximate-string-matching-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!