rvest

Using rvest package when HTML table has two headers

╄→гoц情女王★ 提交于 2019-11-30 20:25:52
问题 I am using the following code to scrape an HTML table on AFL player data: library(rvest) website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html") table <- website %>% html_nodes("table") %>% .[(1)] %>% html_table() The resulting table is 34 obs. of 27 variables, however nrow(table) or ncol(table) both return NULL. Is it correct that this is because there are two rows of headers in the dataframe? I want to be able to do calculations based on individual columns

Scraping webpage with react JS in R

别来无恙 提交于 2019-11-30 18:34:04
问题 I'm trying to scrape page below : https://metro.zakaz.ua/uk/?promotion=1 This page with react content. I can scrape first page with code: url="https://metro.zakaz.ua/uk/?promotion=1" read_html(url)%>% html_nodes("script")%>% .[[8]] %>% html_text()%>% fromJSON()%>% .$catalog%>%.$items%>% data.frame In result I have all items from first page, but I don't know how to scrape others pages. This js code move to other page if that can help: document.querySelectorAll('.catalog-pagination')[0]

Scraping tables on multiple web pages with rvest in R

十年热恋 提交于 2019-11-30 16:58:24
I am new to web scraping and am trying to scrape tables on multiple web pages. Here is the site: http://www.baseball-reference.com/teams/MIL/2016.shtml I am able to scrape a table on one page rather easily using rvest . There are multiple tables, but I only wanted to scrape the first one, here is my code library(rvest) url4 <- "http://www.baseball-reference.com/teams/MIL/2016.shtml" Brewers2016 <- url4 %>% read_html() %>% html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>% html_table() Brewers2016 <- as.data.frame(Brewers2016) The problem is that I want to scrape the first table on

Scraping tables on multiple web pages with rvest in R

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-30 16:17:24
问题 I am new to web scraping and am trying to scrape tables on multiple web pages. Here is the site: http://www.baseball-reference.com/teams/MIL/2016.shtml I am able to scrape a table on one page rather easily using rvest . There are multiple tables, but I only wanted to scrape the first one, here is my code library(rvest) url4 <- "http://www.baseball-reference.com/teams/MIL/2016.shtml" Brewers2016 <- url4 %>% read_html() %>% html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>% html

Using tryCatch and rvest to deal with 404 and other crawling errors

大憨熊 提交于 2019-11-30 14:04:28
问题 When retrieving the h1 title using rvest , I sometimes run into 404 pages. This stop the process and returns this error. Error in open.connection(x, "rb") : HTTP error 404. See the example below Data<-data.frame(Pages=c( "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html", "http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html", "http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html", "http://boingboing.net/2016/06/16/omar-mateen-posted-to

What's my user agent when I parse website with rvest package in R?

十年热恋 提交于 2019-11-30 13:39:47
问题 Since it is easy in R, I am using rvest package to parse HTML to extract informations from website. I am wondering what's my User-Agent (if there is any) during the request, since User-Agent is assigned to the internet browser or is there a way to set it somehow? My code that open session and extract informations from HTML is below: library(rvest) se <- html_session( "http://www.wp.pl" ) %>% html_nodes("[data-st-area=Glonews-mozaika] li:nth-child(7) a") %>% html_attr( name = "href" ) 回答1: I

Using tryCatch and rvest to deal with 404 and other crawling errors

为君一笑 提交于 2019-11-30 09:17:19
When retrieving the h1 title using rvest , I sometimes run into 404 pages. This stop the process and returns this error. Error in open.connection(x, "rb") : HTTP error 404. See the example below Data<-data.frame(Pages=c( "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html", "http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html", "http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html", "http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")) Code used to retrieve h1 library (rvest) sapply(Data$Pages, function(url){ url %>% as

Using 'rvest' to extract links

拥有回忆 提交于 2019-11-30 08:24:19
问题 I am trying to scrap data from Yelp. One step is to extract links from each restaurant. For example, I search restaurants in NYC and get some results. Then I want to extract the links of all the 10 restaurants Yelp recommends on page 1. Here is what I have tried: library(rvest) page=read_html("http://www.yelp.com/search?find_loc=New+York,+NY,+USA") page %>% html_nodes(".biz-name span") %>% html_attr('href') But the code always returns 'NA'. Can anyone help me with that? Thanks! 回答1: library

Web scraping of key stats in Yahoo! Finance with R

自作多情 提交于 2019-11-30 07:52:26
问题 Is anyone experienced in scraping data from the Yahoo! Finance key statistics page with R? I am familiar scraping data directly from html using read_html , html_nodes() , and html_text() from rvest package. However, this web page MSFT key stats is a bit complicated, I am not sure if all the stats are kept in XHR, JS, or Doc. I am guessing the data is stored in JSON. If anyone knows a good way to extract and parse data for this web page with R, kindly answer my question, great thanks in

What's my user agent when I parse website with rvest package in R?

狂风中的少年 提交于 2019-11-30 07:39:47
Since it is easy in R, I am using rvest package to parse HTML to extract informations from website. I am wondering what's my User-Agent (if there is any) during the request, since User-Agent is assigned to the internet browser or is there a way to set it somehow? My code that open session and extract informations from HTML is below: library(rvest) se <- html_session( "http://www.wp.pl" ) %>% html_nodes("[data-st-area=Glonews-mozaika] li:nth-child(7) a") %>% html_attr( name = "href" ) I used https://httpbin.org/user-agent to find out: library(rvest) se <- html_session( "https://httpbin.org/user