R not accepting xpath query

后端 未结 4 1346
不知归路
不知归路 2021-01-28 12:51

Hi I am using the XML package in R to scrape html pages. The page of interest is http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta and on that page there is a sequence

4条回答
  •  悲&欢浪女
    2021-01-28 13:31

    @brucezepplin, I feel your frustration. @Mathias Muller, I worked with what you wrote and ran the following:

    test <- "http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta" 
    doc <- htmlTreeParse(test, asText = TRUE, useInternalNodes = TRUE) 
    xpathSApply(doc, "//div[@id = 'viewercontent1']", xmlValue)
    xpathSApply(doc, "//div[@id = 'viewercontent1']//span[@id = 'gi_225903367_1']", xmlValue)
    xpathSApply(doc, "//div[@id = 'viewercontent1']/gi/span", xmlValue))
    

    First, when I looked at "doc" it only showed a couple of header lines, not the full page.

    But the first xpath returned list(), so at least it was functioning. The next two returned NULL. There is a

     before the desired span nodes as well as a >gi.

    In short, this is not an answer but perhaps will make it easier for someone else to provide a solution.

提交回复
热议问题