Scraping html table and its href Links in R

≯℡__Kan透↙ 提交于 2020-01-13 18:33:32

问题


I am trying to download a table that contains text and links. I can successfully download the table with the link text "Pass". However, instead of the text, I would like to capture the actual href URL.

library(dplyr)
library(rvest)
library(XML)
library(httr)
library(stringr)

link <- "http://www.qimedical.com/resources/method-suitability/"

qi_webpage <- read_html(link)

qi_table <- html_nodes(qi_webpage, 'table')
qi <- html_table(qi_table, header = TRUE)[[1]]
qi <- qi[,-1]

Above gives a nice dataframe. However the last column only contains the text "Pass" when I would like to have the link associated with it. I have tried to use the following to add the links, but they do not correspond to the correct row:

qi_get <- GET("http://www.qimedical.com/resources/method-suitability/")
qi_html <- htmlParse(content(qi_get, as="text"))

qi.urls <- xpathSApply(qi_html, "//*/td[7]/a", xmlAttrs, "href")
qi.urls <- qi.urls[1,]

qi <- mutate(qi, "MSTLink" = (ifelse(qi$`Study Protocol(click to download certification)` == "Pass", (t(qi.urls)), "")))

I know little about html, css, etc, so I am not sure what I am missing to accomplish this properly.

Thanks!!


回答1:


You're looking for a elements inside of table cells, td. Then you want the value of the href attribute. So here's one way, which will return a vector with all the URLs for the PDF downloads:

qi_webpage %>%
  html_nodes(xpath = "//td/a") %>% 
  html_attr("href")


来源:https://stackoverflow.com/questions/43926349/scraping-html-table-and-its-href-links-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!