This question is similar to a previous question, Import all fields (and subfields) of XML as dataframe, but I want to pull out only a subset of the XML data and want to incl
You can use xmlToList
and then plyr
to get a dataframe you can use
require(XML)
require(plyr)
xD <- xmlParse(xData)
xL <- xmlToList(xD)
ldply(xL, data.frame)
> ldply(xL, data.frame)
.id name buildings.building.type buildings.building.bname
1 city London landmark Tower Bridge
2 city New York station Grand Central
3 city Paris landmark Eiffel Tower
buildings.building.type.1 buildings.building.bname.1
1 station Waterloo
2 <NA> <NA>
3 landmark Louvre
You can pick what you need from this dataframe
There is a solution xpathSapply
but writing the xpath here is a little bit complicated.
So, Here I propose a solution using xmlToDataFrame
and using some regular expression to get the buildings.
dd <- xmlToDataFrame(doc)
rr <- gsub('landmark',',',dd$buildings)
rr <- gsub('station.*','',rr)
builds <- lapply(strsplit(gsub('station.*','',rr),','),
function(x)x[nchar(x)>0])
dd$buildings <- builds
name buildings
1 London Tower Bridge
2 New York
3 Paris Eiffel Tower, Louvre
If you're looking to exactly reproduce the desired output you showed in your question, you can convert your XML to a list and then extract the information you want:
xml_list <- xmlToList(xmlParse(xml_data))
First loop through each "building" node and remove those that contain "station":
xml_list <- lapply(xml_list, lapply, function(x) {
x[!sapply(x, function(y) any(y == "station"))]
})
Then collect data for each city into a data frame
xml_list <- lapply(xml_list, function(x) {
bldgs <- unlist(x$buildings)
bldgs <- bldgs[bldgs != "landmark"]
if(is.null(bldgs)) bldgs <- NA
data.frame(
"city" = x$name,
"landmark" = bldgs,
stringsAsFactors = FALSE)
})
Then combine information from all cities together:
xml_output <- do.call("rbind", xml_list)
xml_output
city landmark
city London Tower Bridge
city1 New York <NA>
city.1 Paris Eiffel Tower
city.2 Paris Louvre
Assuming the XML data is in a file called world.xml
read it in and iterate over the cities extracting the city name
and the bname
of any associated landmarks :
library(XML)
doc <- xmlParse("world.xml", useInternalNodes = TRUE)
do.call(rbind, xpathApply(doc, "/world/city", function(node) {
city <- xmlValue(node[["name"]])
xp <- "./buildings/building[./type/text()='landmark']/bname"
landmark <- xpathSApply(node, xp, xmlValue)
if (is.null(landmark)) landmark <- NA
data.frame(city, landmark, stringsAsFactors = FALSE)
}))
The result is:
city landmark
1 London Tower Bridge
2 New York <NA>
3 Paris Eiffel Tower
4 Paris Louvre