Scraping data using rvest and a specific error

江枫思渺然 提交于 2019-12-11 15:19:48

问题


I have this data scraping function:

espn_team_stats <- function(team, side, season) {

 # Libraries
 library(tidyverse)
 library(rvest)

 # Using expand.grid() to run all combinations of the links above
 url_factors <- expand.grid(side = c("batting", "fielding"), 
             team = c("ari", "atl", "bal", "bos", "chc", "chw", "cws",
                     "cin", "cle", "det", "fla", "mia", "hou", "kan",
                     "laa", "lad", "mil", "min", "nyy", "nym", "oak",
                     "phi", "pit", "sd", "sf", "sea", "stl", "tb",
                     "tex", "tor", "was", "wsh"), 
             season = c(2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 
                       2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 
                       2018))

 # URL vectors (Need to put all three url vectors in dataframe)
 team_url <- paste0("http://www.espn.com/mlb/team/stats/", 
                    url_factors$side, "/_/name/",
                    url_factors$team, "/year/", 
                    url_factors$season, "/")
 team_url <- toString(team_url)

 # Building the data table
 team_page <- team_url %>%
   read_html %>%
   html_node("#my-players-table > div.mod-container.mod-table > 
           div.mod-content > table:nth-child(1)") %>%
   html_table(header = T)

 # Setting tables
 team_tables <- team_page
 team_tables$Year <- c(side, team, season)

 return(team_tables)
}

espn_team_stats(bal, batting, 2018)

I can keep getting the following error:

Error in open.connection(x, "rb") : HTTP error 414. # URLs are too long?
Called from: open.connection(x, "rb")

The function will not run - I expect I have something wrong with my expand.grid() call combined with the way the team_url is parsed & pasted together.

An example url of what my url is made from:

http://www.espn.com/mlb/team/stats/batting/_/name/ari/year/2017
http://www.espn.com/mlb/team/stats/fielding/_/name/ari/year/2017

回答1:


Since you are inputing the values that you want, there is no need of using a grid: simply do:

 espn_team_stats <- function(team, side, season) {

  team_url <- paste0("http://www.espn.com/mlb/team/stats/", side, "/_/name/", team, "/year/", season, "/")

    # Building the data table
  team_tables <- team_url %>%
    read_html %>%
    html_node("#my-players-table > div.mod-container.mod-table > 
              div.mod-content > table:nth-child(1)") %>%
    html_table(header =T)

  team_tables$Year <- season

  return(team_tables)
}

espn_team_stats("bal", "batting", 2018)


来源:https://stackoverflow.com/questions/51489815/scraping-data-using-rvest-and-a-specific-error

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!