I have been scraping various pages of basketball-ref for a while now in R with the XML package using "readHTMLtable" without any issues, but now I have one. When I try to scrape the splits section of a player's page, it only return the first line of the table not all.
for example:
URL="http://www.basketball-reference.com/players/j/jamesle01/splits/"
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]
this gives me only one row in the table, the first one. I want all the rows however. I think the problem is that there are multiple headers in the table, but I'm not sure how to fix that.
Thanks
Why not try the rvest
library. You can accomplish this with
library(rvest)
dd <- html_session("http://www.basketball-reference.com/players/j/jamesle01/splits/") %>%
html_node("table#stats") %>%
html_table()
It's still a bit messy with the headers mixed in the data, but it does extract the entire table.
Tested with
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
other attached packages:
[1] rvest_0.2.0
loaded via a namespace (and not attached):
[1] httr_0.6.1 magrittr_1.5 stringr_0.6.2
You can filter on the table bodies:
library(XML)
appURL <- "http://www.basketball-reference.com/players/j/jamesle01/splits/"
doc <- htmlParse(appURL)
appTables <- doc['//table/tbody']
appTables
would be a list containing the 12 tables sans headers. To retrieve the headers you can get them from the thead
:
myHeaders <- unlist(doc["//thead/tr[2]/th", fun = xmlValue])
myTables <- lapply(appTables, readHTMLTable, header = myHeaders)
You can put the data in one big table using something like:
bigTable <- do.call(rbind, myTables)
> head(bigTable)
Split Value G GS MP FG FGA 3P 3PA FT FTA ORB TRB AST STL BLK TOV PF PTS FG% 3P% FT%
1 Total 871 870 34364 8582 17289 1184 3462 5553 7432 1049 6239 6011 1483 698 2906 1615 23901 .496 .342 .747
2 Place Home 441 440 17167 4201 8307 567 1627 2805 3706 507 3133 3082 711 387 1413 744 11774 .506 .348 .757
3 Road 430 430 17197 4381 8982 617 1835 2748 3726 542 3106 2929 772 311 1493 871 12127 .488 .336 .738
4 All-Star Pre 569 568 22349 5544 11167 759 2205 3576 4791 655 4051 3966 967 456 1940 1087 15423 .496 .344 .746
5 Post 302 302 12015 3038 6122 425 1257 1977 2641 394 2188 2045 516 242 966 528 8478 .496 .338 .749
6 Result Win 572 571 22196 5783 11094 772 2154 3749 4931 677 4241 4132 1032 496 1793 1016 16087 .521 .358 .760
TS% USG% ORtg DRtg MP PTS TRB AST
1 .581 31.9 116 103 39.5 27.4 7.2 6.9
2 .592 30.9 118 102 38.9 26.7 7.1 7.0
3 .571 32.8 114 105 40.0 28.2 7.2 6.8
4 .581 31.7 116 103 39.3 27.1 7.1 7.0
5 .582 32.2 117 104 39.8 28.1 7.2 6.8
6 .606 31.7 122 99 38.8 28.1 7.4 7.2
Have a look at the htmltab package (https://github.com/crubba/htmltab). I developed this package for more complex HTML tables where readHTMLTable() is of little use.
devtools::install_github("crubba/htmltab")
library(htmltab)
htmltab(doc = "http://www.basketball-reference.com/players/j/jamesle01/splits/", header = 1:2)
来源:https://stackoverflow.com/questions/27831970/scraping-basketball-reference-com-in-r-xml-package-not-fully-working