问题
I have two vectors in R. I want to find partial matches between them.
My Data
The first one is from a dataset named muc, which contains 6400 street names. muc$name looks like:
muc$name = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße",...)
The other vector is d_vector. It contains around 1400 names.
d_vector = "Abel", "Abendroth", "von Abercron", "Abetz", "Abicht", "Abromeit", ...
I want to find all the street names, that contain a name from d_vector somewhere in the street name.
First, I made some general adaptions after importing the csv data (as variable d):
d_vector <- unlist(d$name)
d_vector <- as.vector(as.matrix(d_vector))
What I tried so far
- Then I tried to find a solution with grep, turning d_vector into containing one long string, separated by | for RegEx-Search:
result <- unique(grep(paste(d_vector, collapse="|"), muc$Name, value=TRUE, ignore.case = TRUE))
result
But the result returns all the street names.
I also tried to use agrep, which retuned a
Out of memory
-Error.When I tried
d_vector %in% muc$name
it returned just one TRUE and hundreds of FALSE, which doesn't seem right.
Do you have any suggestion where my mistake could lay or which library I could use? I am looking for something like python's "fuzzywuzzy" for R
回答1:
Simple solution:
streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße")
streets = tolower(streets) #Lowercase all
names = c("Berber", "Weg")
names = tolower(names)
sapply(names, function (y) sapply(streets, function (x) grepl(y, x)))
# berber weg
#berberichweg TRUE TRUE
#otto-klemperer-weg FALSE TRUE
#feldmeierbogen FALSE FALSE
#altostraße FALSE FALSE
回答2:
In principle, your solution works fine with some dummy data:
streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen",
"Konrad-Adenauer-Platz", "anotherThing")
patterns = c("weg", "platz")
unique(grep(paste(patterns, collapse="|"), streets, value=TRUE, ignore.case = TRUE))
[1] "Berberichweg" "Otto-Klemperer-Weg" "Konrad-Adenauer-Platz"
I think something is not quite in place for the d_vector
. Try to check class(d_vector)
, or dput(d_vector)
and paste that here.
You can also try using sapply
and see if that will work:
matches =sapply(patterns, function(p) grep(p, streets, value=TRUE, ignore.case = TRUE))
# $weg
# [1] "Berberichweg" "Otto-Klemperer-Weg"
#
# $platz
# [1] "Konrad-Adenauer-Platz"
unique(unlist(matches))
# [1] "Berberichweg" "Otto-Klemperer-Weg" "Konrad-Adenauer-Platz"
来源:https://stackoverflow.com/questions/38371321/find-matching-strings-between-two-vectors-in-r