extracting city and state information from a google street address

岁酱吖の 提交于 2020-08-25 06:56:07

问题


I have a data set that contained lat/long information for different point locations, and I would like to know which city and state are associated with each point.

Following this example I used the revgeocode function from ggmap to obtain a street address for each location, producing the data frame below:

df <- structure(list(PointID = c(1787L, 2805L, 3025L, 3027L, 3028L, 
3029L, 3030L, 3031L, 3033L), Latitude = c(38.36648102, 36.19548585, 
43.419774, 43.437222, 43.454722, 43.452643, 43.411949, 43.255479, 
43.261464), Longitude = c(-76.4802046, -94.21554661, -87.960399, 
-88.018333, -87.974722, -87.978542, -87.94149, -87.986433, -87.968612
), Address = structure(c(2L, 8L, 5L, 3L, 9L, 7L, 4L, 1L, 6L), .Label = c("13004 N Thomas Dr, Mequon, WI 53097, USA", 
"2160 Turner Rd, Lusby, MD 20657, USA", "2805 County Rd Y, Saukville, WI 53080, USA", 
"3701-3739 County Hwy W, Saukville, WI 53080, USA", "3907 Echo Ln, Saukville, WI 53080, USA", 
"4823 W Bonniwell Rd, Mequon, WI 53097, USA", "5100-5260 County Rd I, Saukville, WI 53080, USA", 
"7948 W Gibbs Rd, Springdale, AR 72762, USA", "River Park Rd, Saukville, WI 53080, USA"
), class = "factor")), row.names = c(NA, -9L), class = "data.frame", .Names = c("PointID", 
"Latitude", "Longitude", "Address"))

I would like to use R to extract the city/state information from the full street address, and create two columns to store this information ("City" and "State).

I'm assuming the stringr package is the way to go, but I'm not sure how to go about using it. The example above used the following code to extract the zip code (named "result" in that example). Their data set:

#       ID Longitude  Latitude                                         result
# 1 311175  41.29844 -72.92918 16 Church Street South, New Haven, CT 06519, USA
# 2 292058  41.93694 -87.66984  1632 West Nelson Street, Chicago, IL 60657, USA
# 3  12979  37.58096 -77.47144    2077-2199 Seddon Way, Richmond, VA 23230, USA

And code to extract the zipcode:

library(stringr)
data$zipcode <- substr(str_extract(data$result," [0-9]{5}, .+"),2,6)
data[,-4]

Is it possible to easily modify the above code to get the city and state data?


回答1:


You can get the city and state using revgeocode() itself:

df <- cbind(df,do.call(rbind,
               lapply(1:nrow(df),
               function(i) 
               revgeocode(as.numeric(
               df[i,3:2]), output = "more")[c("administrative_area_level_1","locality")])))

df

#   PointID Latitude Longitude                                          Address 
# 1    1787 38.36648 -76.48020             2160 Turner Rd, Lusby, MD 20657, USA 
# 2    2805 36.19549 -94.21555       7948 W Gibbs Rd, Springdale, AR 72762, USA 
# 3    3025 43.41977 -87.96040           3907 Echo Ln, Saukville, WI 53080, USA 
# 4    3027 43.43722 -88.01833       2805 County Rd Y, Saukville, WI 53080, USA 
# 5    3028 43.45472 -87.97472          River Park Rd, Saukville, WI 53080, USA 
# 6    3029 43.45264 -87.97854  5100-5260 County Rd I, Saukville, WI 53080, USA 
# 7    3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA 
# 8    3031 43.25548 -87.98643         13004 N Thomas Dr, Mequon, WI 53097, USA 
# 9    3033 43.26146 -87.96861       4823 W Bonniwell Rd, Mequon, WI 53097, USA 
#   administrative_area_level_1   locality 
# 1                    Maryland      Lusby 
# 2                    Arkansas Springdale 
# 3                   Wisconsin  Saukville 
# 4                   Wisconsin  Saukville 
# 5                   Wisconsin  Saukville 
# 6                   Wisconsin  Saukville 
# 7                   Wisconsin  Saukville 
# 8                   Wisconsin     Mequon 
# 9                   Wisconsin     Mequon

P.S. You can do everything (including getting the address or/and zip code) in one step. Just add "address" or/and "postal_code" to c("administrative_area_level_1","locality") which is the list of variables that you want to extract.




回答2:


If you feel like using stringr, you can do this:

library(stringr)
library(data.table)

parse_address <- function(address){

  address <- address %>% 
    str_split(",") %>% 
    .[[1]]
  state <- address %>% 
    .[3] %>% 
    str_replace_all("[^A-Z]","")

  zip <- address %>% 
    .[3] %>% 
    str_replace_all("[^0-9]","")

  city <- address %>% 
    .[2] %>% 
    str_trim()

  street <- address %>% 
    .[1] %>% 
    str_trim()

  data.table(street, city, state, zip)
}

lapply(df$Address, parse_address) %>% 
  rbindlist



回答3:


1) sub Use sub like this. No packages needed.

The regular expression matches the start (^) followed by the shortest string until a comma and space followed by the shortest string (representing the city) until another comma and space followed by two characters (representing the state), a space, 5 characters (representing the zip code), a comma, a space, USA and end of string. The matches to the parenthesized portions can be referenced via \1, \2 and \3 but within double quotes \ must be doubled.

If your zip codes are not all 5 digits try pat <- "^.*?, (.*?), (..) (.*), USA$" instead.

pat <- "^.*?, (.*?), (..) (.....), USA$"
transform(df, City = sub(pat, "\\1", Address), 
              State = sub(pat, "\\2", Address), 
              Zip = sub(pat, "\\3", Address))

giving:

  PointID Latitude Longitude                                          Address       City State   Zip
1    1787 38.36648 -76.48020             2160 Turner Rd, Lusby, MD 20657, USA      Lusby    MD 20657
2    2805 36.19549 -94.21555       7948 W Gibbs Rd, Springdale, AR 72762, USA Springdale    AR 72762
3    3025 43.41977 -87.96040           3907 Echo Ln, Saukville, WI 53080, USA  Saukville    WI 53080
4    3027 43.43722 -88.01833       2805 County Rd Y, Saukville, WI 53080, USA  Saukville    WI 53080
5    3028 43.45472 -87.97472          River Park Rd, Saukville, WI 53080, USA  Saukville    WI 53080
6    3029 43.45264 -87.97854  5100-5260 County Rd I, Saukville, WI 53080, USA  Saukville    WI 53080
7    3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA  Saukville    WI 53080
8    3031 43.25548 -87.98643         13004 N Thomas Dr, Mequon, WI 53097, USA     Mequon    WI 53097
9    3033 43.26146 -87.96861       4823 W Bonniwell Rd, Mequon, WI 53097, USA     Mequon    WI 53097

2) read.pattern Another possibility is read.pattern with the same pat as above:

library(gsubfn)

cn <- c("City", "State", "Zip")
Address <- as.character(df$Address)
cbind(df, read.pattern(text = Address, pattern = pat, as.is = TRUE, col.names = cn))


来源:https://stackoverflow.com/questions/45723974/extracting-city-and-state-information-from-a-google-street-address

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!