Find matching strings between two vectors in R

佐手、 提交于 2019-11-26 23:31:02

问题


I have two vectors in R. I want to find partial matches between them.

My Data

The first one is from a dataset named muc, which contains 6400 street names. muc$name looks like:

muc$name = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße",...)

The other vector is d_vector. It contains around 1400 names.

d_vector = "Abel", "Abendroth", "von Abercron", "Abetz", "Abicht", "Abromeit", ...

I want to find all the street names, that contain a name from d_vector somewhere in the street name.

First, I made some general adaptions after importing the csv data (as variable d):

d_vector <- unlist(d$name) d_vector <- as.vector(as.matrix(d_vector))

What I tried so far

  • Then I tried to find a solution with grep, turning d_vector into containing one long string, separated by | for RegEx-Search:

result <- unique(grep(paste(d_vector, collapse="|"), muc$Name, value=TRUE, ignore.case = TRUE)) result

But the result returns all the street names.

  • I also tried to use agrep, which retuned a Out of memory-Error.

  • When I tried d_vector %in% muc$nameit returned just one TRUE and hundreds of FALSE, which doesn't seem right.

Do you have any suggestion where my mistake could lay or which library I could use? I am looking for something like python's "fuzzywuzzy" for R


回答1:


Simple solution:

streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße")
streets = tolower(streets) #Lowercase all
names = c("Berber", "Weg")
names = tolower(names)

sapply(names, function (y) sapply(streets, function (x) grepl(y, x)))

#                   berber   weg
#berberichweg        TRUE  TRUE
#otto-klemperer-weg  FALSE TRUE
#feldmeierbogen      FALSE FALSE
#altostraße          FALSE FALSE



回答2:


In principle, your solution works fine with some dummy data:

streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen", 
            "Konrad-Adenauer-Platz", "anotherThing")
patterns = c("weg", "platz")

unique(grep(paste(patterns, collapse="|"), streets, value=TRUE, ignore.case = TRUE))
[1] "Berberichweg"          "Otto-Klemperer-Weg"    "Konrad-Adenauer-Platz"

I think something is not quite in place for the d_vector. Try to check class(d_vector), or dput(d_vector) and paste that here.

You can also try using sapply and see if that will work:

matches =sapply(patterns, function(p) grep(p, streets, value=TRUE, ignore.case = TRUE))
# $weg
# [1] "Berberichweg"       "Otto-Klemperer-Weg"
# 
# $platz
# [1] "Konrad-Adenauer-Platz"

unique(unlist(matches))
# [1] "Berberichweg"          "Otto-Klemperer-Weg"    "Konrad-Adenauer-Platz"


来源:https://stackoverflow.com/questions/38371321/find-matching-strings-between-two-vectors-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!