R not accepting xpath query

后端未结

关注

 4  1361

不知归路 2021-01-28 12:51

Hi I am using the XML package in R to scrape html pages. The page of interest is http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta and on that page there is a sequence

4条回答

陌清茗 (楼主)

2021-01-28 13:43
This gets the list, although I don't know if it's 100% correct as I don't work with fasta files. It seems like lapply(dat, cat) might need to be called on the dat result below.
```
> library(RCurl)
> library(XML)
> url <- getURL("http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta")
> dat <- readHTMLList(url)
> length(dat)
# [1] 39
> object.size(dat)
# 42704 bytes
```
The whole list is not very big, so I'd recommend bringing the whole list into R. Then you have all the relevant data, and you don't have to spend the whole day trying to regex an html document. It looks like the unexpected symbol might be triggered because you wrote //*, and that * needs escape characters on it, possibly //[*].

Edit that error you got was due to double quotation marks inside other double quotation marks. In R that should be quoted "//*[@id='viewercontent1']/pre"

Yes, XML can be fussy, but it's generally because (1) it's the internet, and (2) the parser expects certain things to be in the html code and sometimes it's not. My professor wrote both RCurl and XML and he recommends going to RCurl::getURL when for the xml document when XML::readHTMLTable or any of the other read* functions have trouble.

These issues you're having with the output are not strange. They are an empty result, which is as expected from the functions that assign attributes.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...