Using R to scrape tables when URL does not change

爱⌒轻易说出口 提交于 2019-12-25 03:26:40

问题


I'm relatively new to scraping in R and have had great luck using "rvest", but I've run into an issue I cannot solve.

The website I am trying to scrape has the same URL no matter what page of the table you are on. For example, the main webpage is www.blah.com with one main table on it that has 10 other "next" pages of the same table, but just the next in order (I apologize for not linking to the actual page as I cannot due to work issues).

So, if I'm on page 1 of the table, the URL is www.blah.com. If I'm on page 2 of the table the URL is www.blah.com and so on... The URL never changes.

Here is my code so far. I'm using a combination of rvest and phantomjs. The code works perfectly, but only for getting page 1 of the table, not the corresponding "next" 10 pages of the table:

url <- "http://www.blah.com"

writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
   console.log(page.content); //page source
   phantom.exit();
});", url), con="scrape.js")

system(phantomjs scrape.js > scrape.html") 

page <- html("scrape.html")
page %>% html_nodes("td:nth-child(4)") %>% html_text()

And, this is the HTML code for page 2 of the table from the website (all other pages of the table are identical except for replacing the 2 with 3 and so on up the list):

<li><a href="#" id="p_2">2</a></li>

Thanks so much for any advice/help you can give!

来源:https://stackoverflow.com/questions/28438010/using-r-to-scrape-tables-when-url-does-not-change

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!