Extract a specific key word from a string in R

南楼画角 提交于 2019-12-11 02:40:45

问题


I have a column "place" in my table which contains data about a place that looks like:

{ "id" : "94965b2c45386f87", "name" : "New York", "boundingBoxCoordinates" : [ [ { "longitude" : -79.76259, "latitude" : 40.477383 }, { "longitude" : -79.76259, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 40.477383 } ] ], "countryCode" : "US", "fullName" : "New York, USA", "boundingBoxType" : "Polygon", "URL" : "https://api.twitter.com/1.1/geo/id/94965b2c45386f87.json", "accessLevel" : 0, "placeType" : "admin", "country" : "United States" }

From this, I want to extract the country name. I have tried the following code:

loc <- t1$place
loc = gsub('"', '', loc)
loc = gsub(',', '', loc)

to clean up the string and now it looks like this:

"{ id : 00ed6f0947c230f4 name : Caloocan City boundingBoxCoordinates : [ [ { longitude : 120.9607709 latitude : 14.6344661 } { longitude : 120.9607709 latitude : 14.7873208 } { longitude : 121.1015117 latitude : 14.7873208 } { longitude : 121.1015117 latitude : 14.6344661 } ] ] countryCode : PH fullName : Caloocan City National Capital Region boundingBoxType : Polygon URL : https://api.twitter.com/1.1/geo/id/00ed6f0947c230f4.json accessLevel : 0 placeType : city country : Republika ng Pilipinas }"

Now to extract the country name, I want to use the word() function:

word(loc, n, sep=fixed(" : "))

where n in the position of the country name I still did not count. But this function gives the correct output when n=1 but gives an error for any other vaue of n:

Error in word[loc, "start"] : subscript out of bounds

Why is that happening? The loc variable certainly has more words with that separation. Or can someone suggest a better way of extracting the country name from that field?

EDIT: t1 is the dataframe that consists my entire table. Presently I am interested only in the place field of my table which has the information in the above mentioned format. Hence I am trying to load the place field into a separate variable called "loc" using the basic assignment instruction:

loc <- t1$place

In order to read it as a JSON, the place field needs to be delimited by single quotes which it is not originally. I have 2 millions rows in my table so I really can't manually add the delimiters.


回答1:


This looks like a JSON object so it would be easier to use a JSON parse to extract the data.

So if this your string value

x <- '{ "id" : "94965b2c45386f87", "name" : "New York", "boundingBoxCoordinates" : [ [ { "longitude" : -79.76259, "latitude" : 40.477383 }, { "longitude" : -79.76259, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 40.477383 } ] ], "countryCode" : "US", "fullName" : "New York, USA", "boundingBoxType" : "Polygon", "URL" : "https://api.twitter.com/1.1/geo/id/94965b2c45386f87.json", "accessLevel" : 0, "placeType" : "admin", "country" : "United States" }'

then you can do

library(jsonlite)
# or library(RJSOINIO)
# or library(rjson)

fromJSON(x)$country
# [1] "United States"


来源:https://stackoverflow.com/questions/30268906/extract-a-specific-key-word-from-a-string-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!