I am currently exploring the possibility of extracting country name from Author Affiliations (PubMed Articles) my sample data looks like:
Mechanical and Produc
@Andrie's answer is nice, but it misses cities and countries that are more than one word e.g. New Zealand or New York. The second example is a concern as it would be labelled as a match to York, UK not New York, USA.
This alternative should capture those cases a bit better.
library(maps)
library(plyr)
# Load data from package maps
data(world.cities)
# Create test data
aa <- c(
"Mechanical and Production Engineering Department, National University of Singapore.",
"Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.",
"Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.",
"Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285."
)
saa <- sapply(aa, strsplit, split = ", ", USE.NAMES = FALSE)
llply(saa, function(x)x[which(x %in% world.cities$name)])
llply(saa, function(x)x[which(x %in% world.cities$country.etc)])
The downside is that any entries without a specific country or city field is not going to return anything e.g. the University of Singapore example.
Cities:
[[1]]
character(0)
[[2]]
[1] "Cambridge"
[[3]]
[1] "Cambridge"
[[4]]
[1] "Indianapolis"
That is less of an issue for me than the multi-word city/country problem. Choose whichever is a better fit for your data. Maybe there's a way of combining the two?