Read HTML Table Into Data Frame with Hyperlinks in R

白昼怎懂夜的黑 提交于 2019-12-25 10:57:06

问题


I am trying to read an HTML table from a publicly-accessible website into a data frame in R. The final column of the table contains hyperlinks, and I would like to read these hyperlinks into the table rather than the text that is displayed on the webpage. I've reviewed several posts here on StackOverflow and on other sites and have gotten almost there, but I haven't been able to read the hyperlinks themselves.

The table I'm trying to read is here: http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey.

The final column contains hyperlinks that point to the actual data in *.ZIP file format for download. I've managed to read the table into R as text, but I can't figure out how to resolve the hyperlinks in the final column.

Here's what I have so far:

library(XML)
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
page <- htmlParse( webURL )
tableNodes <- getNodeSet( sitePage, "//table" )
myTable <- readHTMLTable( tableNodes[[3]] )

However, this contains the text in the final column, not the hyperlink. How do I replace the word "zip" in the final column of this table in R with the values for the corresponding hyperlink in each row?


回答1:


This code will let you target either the XML files or the CSV files and you get the filename as well as the URL so you can then iterate over the URLs and filenames and save them with names you'll recognize later on.

library(rvest)
library(dplyr)

pg <- read_html("http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey")

csv_fils <- html_nodes(pg, xpath=".//td[contains(@class, 'labelOptional_ind') and contains(., 'csv')]/..")

data_frame(
  fil_name = html_nodes(csv_fils, "td.labelOptional_ind") %>% html_text(),
  url = html_nodes(csv_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> csv_df

glimpse(csv_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015151.LMPSROSNODENP6788_20170729_094011_csv.zip", "cdr...
## $ url      <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923018", "/misdownload/servlets/mirD...

xml_fils <- html_nodes(pg, xpath=".//td[contains(@class, 'labelOptional_ind') and contains(., 'xml')]/..")

data_frame(
  fil_name = html_nodes(xml_fils, "td.labelOptional_ind") %>% html_text(),
  url = html_nodes(xml_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> xml_df

glimpse(xml_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015016.LMPSROSNODENP6788_20170729_094011_xml.zip", "cdr...
## $ url      <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923015", "/misdownload/servlets/mirD...



回答2:


I find using the rvest package easier than XML.

Here is a solution to obtain a list of the links:

webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'

library(rvest)

page<-read_html(webURL)
links<-page %>% html_nodes("a") %>% html_attr("href")


来源:https://stackoverflow.com/questions/45385500/read-html-table-into-data-frame-with-hyperlinks-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!