How can I extract info from xml page with R

问题

I'm trying to get all the info from this page: http://ws.parlament.ch/affairs/19110758/?format=xml

First I download the file into fileand parse it then with xmlParse(file).

download.file(url = paste0(http://ws.parlament.ch/affairs/19110758/?format=xml), destfile = destfile)
file <- xmlParse(destfile[])

I now want to extract all the information I need. For example the title and the ID-number. I tried something like this:

title <- xpathSApply(file, "//h2", xmlValue)

But this gives me only an error: unable to find an inherited method for function ‘saveXML’ for signature ‘"XMLDocument"

Next thing I tried is this:

library(plyr)

test <-ldply(xmlToList(file), function(x) { data.frame(x[!names(x)=="id"]) } )

This gives me a data.framewith some Info. But I lose info such as id (which is most important).

I'd like to get a data.frame with a row (only one row per affair) containing all the Information of one affair, such as id``updated additionalIndexing``affairTypeetc.

With this, it works (example for id):

infofile <- xmlRoot(file)

nodes <-  getNodeSet(file, "//affair/id")
id <-as.numeric(lapply(nodes, function(x) xmlSApply(x, xmlValue)))

回答1:

This will get you to your XML:

library(XML)
library(RCurl)
library(httr)

srcXML <- getURL("http://ws.parlament.ch/affairs/19110758/?format=xml", 
            .opts=c(user_agent("Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"),
              verbose()))

myXMLFile <- xmlTreeParse(substr(srcXML,4,nchar(srcXML)))

I would have used just GET() from httr but it doesn't seem to pass the user-agent along well (I need to test it when I'm not behind a proxy to be sure of what the specific error is). I also did the substr() as there's a bunch of weird characters at the front that cause the xmlTreeParse() call to error out.

回答2:

It is an HTML file, not an XML file. You need to use htmlParse:

destfile <- tempfile() # make this example copy-pasteable
download.file(url = "http://ws.parlament.ch/affairs/19110758/?format=xml", destfile = destfile)
file <- htmlParse(destfile)
title <- xpathSApply(file, '//h2')
xmlValue(title[[1]])
# [1] "Heilmittelwesen. Gesetzgebung"

来源：https://stackoverflow.com/questions/22717412/how-can-i-extract-info-from-xml-page-with-r

标签

xml

xml-parsing