How to match a string with a tolerance of one character?

二次信任 提交于 2019-12-12 00:43:48

问题


I have a vector of locations that I am trying to disambiguate against a vector of correct location names. For this example I am using just two disambiguated locations tho:

agrepl('Au', c("Austin, TX", "Houston, TX"), 
max.distance =  .000000001, 
ignore.case = T, fixed = T)
[1] TRUE TRUE

The help page says that max.distance is

Maximum distance allowed for a match. Expressed either as integer, or as a fraction of the pattern length times the maximal transformation cost

I am not sure about the mathematical meaning of the Levensthein distance; my understanding is that smaller the distance, the stricter the tolerance for mismatches with my vector of disambiguated strings.

So I would I adjust it to retrieve two FALSE? Basically I would like to have a TRUE only when there is a difference of 1 character like in:

agrepl('Austn, TX', "Austin, TX", 
max.distance =  .000000001, ignore.case = T, fixed = T)
[1] TRUE

回答1:


The problem you are having is possibly similar to the one I faced when starting the to experiment here. The first argument is a regex-pattern when fixed=TRUE, so small patterns are very permissive if not constrained to be the full string. The help page even has a "Note" about that issue:

Since someone who read the description carelessly even filed a bug report on it, do note that this matches substrings of each element of x (just as grep does) and not whole elements.

Using regex patterns you do this by flanking the pattern string by "^" and "$", since unlike adist, agrepl has no partial parameter:

> agrepl('^Au$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE
> agrepl('^Austn, TX$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] TRUE
> agrepl('^Austn, T$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE

So you need to paste0 with those flankers:

> agrepl( paste0('^', 'Austn, Tx', '$'), "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] TRUE
> agrepl( paste0('^', 'Au', '$'), "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE

Might be better to use all rather than just insertions, and you may want to lower the fraction.



来源:https://stackoverflow.com/questions/37558974/how-to-match-a-string-with-a-tolerance-of-one-character

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!