Hi I am using the XML package in R to scrape html pages. The page of interest is http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta and on that page there is a sequence
@brucezepplin, I feel your frustration. @Mathias Muller, I worked with what you wrote and ran the following:
test <- "http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta"
doc <- htmlTreeParse(test, asText = TRUE, useInternalNodes = TRUE)
xpathSApply(doc, "//div[@id = 'viewercontent1']", xmlValue)
xpathSApply(doc, "//div[@id = 'viewercontent1']//span[@id = 'gi_225903367_1']", xmlValue)
xpathSApply(doc, "//div[@id = 'viewercontent1']/gi/span", xmlValue))
First, when I looked at "doc" it only showed a couple of header lines, not the full page.
But the first xpath returned list()
, so at least it was functioning. The next two returned NUL
L. There is a before the desired span nodes as well as a >gi.
In short, this is not an answer but perhaps will make it easier for someone else to provide a solution.