Extracting html table from a website in R

后端 未结 2 514
闹比i
闹比i 2020-12-20 04:41

Hi I am trying to extract the table from the premierleague website.

The package I am using is rvest package and the code I am using in th

2条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-20 05:18

    Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):

    library(RSelenium)
    library(rvest)
    
    # initialize browser and driver with RSelenium
    ptm <- phantom()
    rd <- remoteDriver(browserName = 'phantomjs')
    rd$open()
    
    # grab source for page
    rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
    html <- rd$getPageSource()[[1]]
    
    # clean up
    rd$close()
    ptm$stop()
    
    # parse with rvest
    df <- html %>% read_html() %>% 
        html_node('#ismr-event-history table.ism-table') %>% 
        html_table() %>% 
        setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>%    # clean column names
        setNames(gsub('\\s', '_', names(.)))
    
    str(df)
    ## 'data.frame':    20 obs. of  10 variables:
    ##  $ Gameweek                : chr  "GW1" "GW2" "GW3" "GW4" ...
    ##  $ Gameweek_Points         : int  34 47 53 51 66 66 65 63 48 90 ...
    ##  $ Points_Bench            : int  1 6 9 7 14 2 9 3 8 2 ...
    ##  $ Gameweek_Rank           : chr  "2,406,373" "2,659,789" "541,258" "905,524" ...
    ##  $ Transfers_Made          : int  0 0 2 0 3 2 2 0 2 0 ...
    ##  $ Transfers_Cost          : int  0 0 0 0 4 4 4 0 0 0 ...
    ##  $ Overall_Points          : chr  "34" "81" "134" "185" ...
    ##  $ Overall_Rank            : chr  "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
    ##  $ Value                   : chr  "£100.0" "£100.0" "£99.9" "£100.0" ...
    ##  $ Change_Previous_Gameweek: logi  NA NA NA NA NA NA ...
    

    As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number) will do pretty well.) The arrows are images which is why the last column is all NA, but you can calculate those anyway.

提交回复
热议问题