Scraping a complex HTML table into a data.frame in R

不问归期 提交于 2019-11-29 04:33:07

Maybe like this

library(XML)
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "Wilson, JamesJames Wilson"       "Jay, JohnJohn Jay†"              "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."     
# [5] "Rutledge, JohnJohn Rutledge"     "Iredell, JamesJames Iredel

removeNodes(getNodeSet(html, "//table/tr/td[2]/span"))
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "James Wilson"    "John Jay†"       "William Cushing" "John Blair, Jr." "John Rutledge"   "James Iredell" 

You could use rvest

library(rvest)

html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")%>%   
  html_nodes("span+ a") %>% 
  html_text()

It's not perfect so you might want to refine the css selector but it gets you fairly close.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!