Scraping basketball-reference.com in R (XML package not fully working)

I have been scraping various pages of basketball-ref for a while now in R with the XML package using "readHTMLtable" without any issues, but now I have one. When I try to scrape the splits section of a player's page, it only return the first line of the table not all.

for example:

URL="http://www.basketball-reference.com/players/j/jamesle01/splits/"
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]

this gives me only one row in the table, the first one. I want all the rows however. I think the problem is that there are multiple headers in the table, but I'm not sure how to fix that.

Thanks

Why not try the rvest library. You can accomplish this with

library(rvest)
dd <- html_session("http://www.basketball-reference.com/players/j/jamesle01/splits/") %>%
    html_node("table#stats") %>%
    html_table()

It's still a bit messy with the headers mixed in the data, but it does extract the entire table.

Tested with

R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

other attached packages:
[1] rvest_0.2.0

loaded via a namespace (and not attached):
[1] httr_0.6.1    magrittr_1.5  stringr_0.6.2

You can filter on the table bodies:

library(XML)
appURL <- "http://www.basketball-reference.com/players/j/jamesle01/splits/"
doc <- htmlParse(appURL)
appTables <- doc['//table/tbody']

appTables would be a list containing the 12 tables sans headers. To retrieve the headers you can get them from the thead:

myHeaders <- unlist(doc["//thead/tr[2]/th", fun = xmlValue])
myTables <- lapply(appTables, readHTMLTable, header = myHeaders)

You can put the data in one big table using something like:

bigTable <- do.call(rbind, myTables)
> head(bigTable)
Split Value   G  GS    MP   FG   FGA   3P  3PA   FT  FTA  ORB  TRB  AST  STL BLK  TOV   PF   PTS  FG%  3P%  FT%
1          Total 871 870 34364 8582 17289 1184 3462 5553 7432 1049 6239 6011 1483 698 2906 1615 23901 .496 .342 .747
2    Place  Home 441 440 17167 4201  8307  567 1627 2805 3706  507 3133 3082  711 387 1413  744 11774 .506 .348 .757
3           Road 430 430 17197 4381  8982  617 1835 2748 3726  542 3106 2929  772 311 1493  871 12127 .488 .336 .738
4 All-Star   Pre 569 568 22349 5544 11167  759 2205 3576 4791  655 4051 3966  967 456 1940 1087 15423 .496 .344 .746
5           Post 302 302 12015 3038  6122  425 1257 1977 2641  394 2188 2045  516 242  966  528  8478 .496 .338 .749
6   Result   Win 572 571 22196 5783 11094  772 2154 3749 4931  677 4241 4132 1032 496 1793 1016 16087 .521 .358 .760
TS% USG% ORtg DRtg   MP  PTS TRB AST
1 .581 31.9  116  103 39.5 27.4 7.2 6.9
2 .592 30.9  118  102 38.9 26.7 7.1 7.0
3 .571 32.8  114  105 40.0 28.2 7.2 6.8
4 .581 31.7  116  103 39.3 27.1 7.1 7.0
5 .582 32.2  117  104 39.8 28.1 7.2 6.8
6 .606 31.7  122   99 38.8 28.1 7.4 7.2

Have a look at the htmltab package (https://github.com/crubba/htmltab). I developed this package for more complex HTML tables where readHTMLTable() is of little use.

devtools::install_github("crubba/htmltab")
library(htmltab)
htmltab(doc = "http://www.basketball-reference.com/players/j/jamesle01/splits/", header = 1:2)

来源：https://stackoverflow.com/questions/27831970/scraping-basketball-reference-com-in-r-xml-package-not-fully-working

标签

xml

screen-scraping