Extracting html table from a website in R

后端未结

关注

 2  514

闹比i 2020-12-20 04:41

Hi I am trying to extract the table from the premierleague website.

The package I am using is rvest package and the code I am using in th

2条回答

刺人心 (楼主)

2020-12-20 05:18

Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):

library(RSelenium)
library(rvest)

# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()

# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]

# clean up
rd$close()
ptm$stop()

# parse with rvest
df <- html %>% read_html() %>% 
    html_node('#ismr-event-history table.ism-table') %>% 
    html_table() %>% 
    setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>%    # clean column names
    setNames(gsub('\\s', '_', names(.)))

str(df)
## 'data.frame':    20 obs. of  10 variables:
##  $ Gameweek                : chr  "GW1" "GW2" "GW3" "GW4" ...
##  $ Gameweek_Points         : int  34 47 53 51 66 66 65 63 48 90 ...
##  $ Points_Bench            : int  1 6 9 7 14 2 9 3 8 2 ...
##  $ Gameweek_Rank           : chr  "2,406,373" "2,659,789" "541,258" "905,524" ...
##  $ Transfers_Made          : int  0 0 2 0 3 2 2 0 2 0 ...
##  $ Transfers_Cost          : int  0 0 0 0 4 4 4 0 0 0 ...
##  $ Overall_Points          : chr  "34" "81" "134" "185" ...
##  $ Overall_Rank            : chr  "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
##  $ Value                   : chr  "£100.0" "£100.0" "£99.9" "£100.0" ...
##  $ Change_Previous_Gameweek: logi  NA NA NA NA NA NA ...

As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number) will do pretty well.) The arrows are images which is why the last column is all NA, but you can calculate those anyway.

0 讨论(0)

查看其它2个回答