问题
I'm trying to scrape game logs of every MLB player dating back to 2000 from baseball-reference.com using R. I've read a ton of stuff that is helpful, but not exactly extensive enough for my purposes. The URL for say, Curtis Granderson's 2016 game logs is https://www.baseball-reference.com/players/gl.fcgi?id=grandcu01&t=b&year=2016.
If I have a list of player IDs and years I know I should be able to loop through them somehow with a function similar to this one that grabs attendance by year:
fetch_attendance <- function(year) {
url <- paste0("http://www.baseball-reference.com/leagues/MLB/", year,
"-misc.shtml")
data <- readHTMLTable(url, stringsAsFactors = FALSE)
data <- data[[1]]
data$year <- year
data
}
But, again, I'm struggling to create a more extensive function that does the job. Any help is much appreciated. Thank you!
回答1:
To generate a list of player_id, you can do something like below:
library(rvest);
scraping_MLB <- read_html("https://www.baseball-reference.com/players/");
player_name1 <- scraping_MLB %>% html_nodes(xpath = '//*[@id="content"]/ul') %>% html_nodes("div")%>% html_nodes("a") %>% html_text()
player_name2 <- lapply(player_name1,function(x)strsplit(x,split = ","))
player_name<- setNames(do.call(rbind.data.frame, player_name2), "Players_Name")
player_id1 <- scraping_MLB %>% html_nodes(xpath = '//*[@id="content"]/ul')%>% html_nodes("div") %>% html_nodes("a") %>% html_attr("href")
player_id <- setNames(as.data.frame(player_id1),"Players_ID")
player_id$Players_ID <- sub("(\\/.*\\/.*\\/)(\\w+)(..*)","\\2",player_id$Players_ID)
player_df <- cbind(player_name,player_id)
head(player_df)
Once you have the list of all player's id then you can easily loop through by generalizing this url https://www.baseball-reference.com/players/gl.fcgi?id=grandcu01&t=b&year=2016.
(Edit note: added this code snippet after a clarification from OP)
You can start with below sample code and optimize it using mapply or something:
#it fetches the data of first four players from player_df for the duration 2000-16
library(rvest);
players_stat = list()
j=1
for (i in 1:nrow(player_df[c(1:4),])){
for (year in 2000:2016){
scrapped_page <- read_html(paste0("https://www.baseball-reference.com/players/gl.fcgi?id=",
as.character(player_df$Players_ID[i]),"&t=b&year=",year))
if (length(html_nodes(scrapped_page, "table")) >=1){
#scrapped_data <- html_table(html_nodes(scrapped_page, "table")[[1]])
tab <-html_attrs(html_nodes(scrapped_page, "table"))
batting_gamelogs<-which(sapply(tab, function(x){x[2]})=="batting_gamelogs")
scrapped_data <- html_table(html_nodes(scrapped_page, "table")[[batting_gamelogs]], fill=TRUE)
scrapped_data$Year <- year
scrapped_data$Players_Name <- player_df$Players_Name[i]
players_stat[[j]] <- scrapped_data
names(players_stat)[j] <- as.character(paste0(player_df$Players_ID[i],"_",year))
j <- j+1
}
}
}
players_stat
Hope this helps!
来源:https://stackoverflow.com/questions/45534808/web-scraping-looping-through-list-of-ids-and-years-in-r