Using rvest package when HTML table has two headers

╄→гoц情女王★ 提交于 2019-11-30 20:25:52

问题


I am using the following code to scrape an HTML table on AFL player data:

library(rvest)

website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html")
table   <- website %>%
           html_nodes("table") %>%
           .[(1)] %>%
           html_table()

The resulting table is 34 obs. of 27 variables, however nrow(table) or ncol(table) both return NULL. Is it correct that this is because there are two rows of headers in the dataframe? I want to be able to do calculations based on individual columns however the following gives an error:

table[,1]
# Error in table[, 1] : incorrect number of dimensions

Which does it produce this error and how can I solve it?


回答1:



library(rvest)
#> Le chargement a nécessité le package : xml2

website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html")

On this website, you have several tables, one per link displayed above the printed table on the main page. Using html_tables on the result of html_nodes("tables") allows you to get all the tables in a list at once.

all_tables <- website %>%
  html_nodes("table") %>%
  html_table()

str(all_tables, 1)
#> List of 23
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:
#>  $ :'data.frame':    34 obs. of  27 variables:

You can then select the table you want but the header are still not right

head(all_tables[[1]])
#>          Disposals Disposals Disposals Disposals Disposals Disposals
#> 1           Player        R1        R2        R3        R4        R5
#> 2     Atkins, Rory        19        19        19        23        29
#> 3  Beech, Jonathon                                                  
#> 4     Betts, Eddie        18        13        16        22        12
#> 5      Brown, Luke        18        12        13         9        15
#> 6 Cameron, Charlie        23        17        16        16        13
#>   Disposals Disposals Disposals Disposals Disposals Disposals Disposals
#> 1        R6        R7        R8        R9       R10       R11       R12
#> 2        23        20        21        28        37        14        25
#> 3                                                                    15
#> 4        16        13         9        16        14        12        11
#> 5        17        13        20        25        16        12          
#> 6        13        14        10        18        13         8        13
#>   Disposals Disposals Disposals Disposals Disposals Disposals Disposals
#> 1       R14       R15       R16       R17       R18       R19       R20
#> 2        28        15        23        18        19        16        16
#> 3        12        11                                                  
#> 4        14        11        13        16         8                  16
#> 5        10        15        14        17        11        10        20
#> 6        15                  10        20         6         9        17
#>   Disposals Disposals Disposals Disposals Disposals Disposals Disposals
#> 1       R21       R22       R23        QF        PF        GF       Tot
#> 2        27        21        21        16        22        17       536
#> 3                                                                    38
#> 4         7        16        12        13        13         7       318
#> 5        17        17         9        20        10        13       353
#> 6        13        10        10        15        19        16       334

Using some manipulation on the list and tables with purrr and dplyr, you can format your table which has 2 headers:

all_tables   <- website %>%
  html_nodes("table") %>%
  # do not let httr handles header automatically. 
  html_table(header = FALSE)

library(purrr)
#> 
#> Attachement du package : 'purrr'
#> The following object is masked from 'package:rvest':
#> 
#>     pluck
all_tables <- all_tables %>%
  # get the first column, first row to set the name for the list elements
  # pluck is a purrr function acting like x[[1]][1, 1] here
  lmap( ~ set_names(.x, nm = pluck(.x, 1, 1, 1))) %>%
  # For each table, set second line as header 
  # and delete first and second line
  map(~ set_names(.x, nm = .x[2, ]) %>% slice(-c(1, 2)))
str(all_tables_res, 1)
#> List of 23
#>  $ Disposals              :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Kicks                  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Marks                  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Handballs              :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Goals                  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Behinds                :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Hit Outs               :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Tackles                :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Rebounds               :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Inside 50s             :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Clearances             :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Clangers               :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Frees                  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Frees Against          :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Brownlow Votes         :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Contested Possessions  :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Uncontested Possessions:Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Contested Marks        :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Marks Inside 50        :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ One Percenters         :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Bounces                :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ Goal Assists           :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:
#>  $ % Played               :Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of  27 variables:

You can now called any table of the website.

head(all_tables_res$Goals)
#> # A tibble: 6 x 27
#>             Player    R1    R2    R3    R4    R5    R6    R7    R8    R9
#>              <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1     Atkins, Rory     3     1     -     2     1     -     1     1     -
#> 2  Beech, Jonathon                                                      
#> 3     Betts, Eddie     4     3     3     6     3     1     3     2     3
#> 4      Brown, Luke     -     1     -     -     1     -     -     -     -
#> 5 Cameron, Charlie     2     1     -     1     2     2     2     -     4
#> 6     Crouch, Brad                             -     -     -     -     1
#> # ... with 17 more variables: R10 <chr>, R11 <chr>, R12 <chr>, R14 <chr>,
#> #   R15 <chr>, R16 <chr>, R17 <chr>, R18 <chr>, R19 <chr>, R20 <chr>,
#> #   R21 <chr>, R22 <chr>, R23 <chr>, QF <chr>, PF <chr>, GF <chr>,
#> #   Tot <chr>



回答2:


Firstly, and unrelated to your question: Don't use table as a name for your objects, because this name is already reserved for other functionalities in R. It is considered bad practice and I've been told that it will come back and nip you in the butt somewhere down the line.

Moving on to the question: You are struggling with the type of data that html_table() gives you. You are returned a list, which contains a regular data.frame. The list you outputted, has NULL for the number of columns and rows, because that list only has one element: the data.frame. By selecting that first (and only) element of your list, you will get to the dataframe you're actually interesting in. This dataframe has 27 columns and 34 rows

website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html")
scraped <- website %>%
                html_nodes("table") %>%
                .[(1)] %>%
                html_table() %>%
                `[[`(1)   # Select the first element of the list, like scraped[[1]]
ncol(scraped) 
# 27
nrow(scraped)
# 34


来源:https://stackoverflow.com/questions/46618599/using-rvest-package-when-html-table-has-two-headers

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!