rvest | 易学教程

Using rvest package when HTML table has two headers

阅读更多关于 Using rvest package when HTML table has two headers

问题 I am using the following code to scrape an HTML table on AFL player data: library(rvest) website <-read_html("https://afltables.com/afl/stats/teams/adelaide/2017_gbg.html") table <- website %>% html_nodes("table") %>% .[(1)] %>% html_table() The resulting table is 34 obs. of 27 variables, however nrow(table) or ncol(table) both return NULL. Is it correct that this is because there are two rows of headers in the dataframe? I want to be able to do calculations based on individual columns

Scraping webpage with react JS in R

阅读更多关于 Scraping webpage with react JS in R

问题 I'm trying to scrape page below : https://metro.zakaz.ua/uk/?promotion=1 This page with react content. I can scrape first page with code: url="https://metro.zakaz.ua/uk/?promotion=1" read_html(url)%>% html_nodes("script")%>% .[[8]] %>% html_text()%>% fromJSON()%>% .$catalog%>%.$items%>% data.frame In result I have all items from first page, but I don't know how to scrape others pages. This js code move to other page if that can help: document.querySelectorAll('.catalog-pagination')[0]

Scraping tables on multiple web pages with rvest in R

阅读更多关于 Scraping tables on multiple web pages with rvest in R

I am new to web scraping and am trying to scrape tables on multiple web pages. Here is the site: http://www.baseball-reference.com/teams/MIL/2016.shtml I am able to scrape a table on one page rather easily using rvest . There are multiple tables, but I only wanted to scrape the first one, here is my code library(rvest) url4 <- "http://www.baseball-reference.com/teams/MIL/2016.shtml" Brewers2016 <- url4 %>% read_html() %>% html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>% html_table() Brewers2016 <- as.data.frame(Brewers2016) The problem is that I want to scrape the first table on

Scraping tables on multiple web pages with rvest in R

阅读更多关于 Scraping tables on multiple web pages with rvest in R

问题 I am new to web scraping and am trying to scrape tables on multiple web pages. Here is the site: http://www.baseball-reference.com/teams/MIL/2016.shtml I am able to scrape a table on one page rather easily using rvest . There are multiple tables, but I only wanted to scrape the first one, here is my code library(rvest) url4 <- "http://www.baseball-reference.com/teams/MIL/2016.shtml" Brewers2016 <- url4 %>% read_html() %>% html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>% html

Using tryCatch and rvest to deal with 404 and other crawling errors

阅读更多关于 Using tryCatch and rvest to deal with 404 and other crawling errors

问题 When retrieving the h1 title using rvest , I sometimes run into 404 pages. This stop the process and returns this error. Error in open.connection(x, "rb") : HTTP error 404. See the example below Data<-data.frame(Pages=c( "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html", "http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html", "http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html", "http://boingboing.net/2016/06/16/omar-mateen-posted-to

What's my user agent when I parse website with rvest package in R?

阅读更多关于 What's my user agent when I parse website with rvest package in R?

问题 Since it is easy in R, I am using rvest package to parse HTML to extract informations from website. I am wondering what's my User-Agent (if there is any) during the request, since User-Agent is assigned to the internet browser or is there a way to set it somehow? My code that open session and extract informations from HTML is below: library(rvest) se <- html_session( "http://www.wp.pl" ) %>% html_nodes("[data-st-area=Glonews-mozaika] li:nth-child(7) a") %>% html_attr( name = "href" ) 回答1: I

Using tryCatch and rvest to deal with 404 and other crawling errors

阅读更多关于 Using tryCatch and rvest to deal with 404 and other crawling errors

When retrieving the h1 title using rvest , I sometimes run into 404 pages. This stop the process and returns this error. Error in open.connection(x, "rb") : HTTP error 404. See the example below Data<-data.frame(Pages=c( "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html", "http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html", "http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html", "http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")) Code used to retrieve h1 library (rvest) sapply(Data$Pages, function(url){ url %>% as

Using 'rvest' to extract links

阅读更多关于 Using 'rvest' to extract links

问题 I am trying to scrap data from Yelp. One step is to extract links from each restaurant. For example, I search restaurants in NYC and get some results. Then I want to extract the links of all the 10 restaurants Yelp recommends on page 1. Here is what I have tried: library(rvest) page=read_html("http://www.yelp.com/search?find_loc=New+York,+NY,+USA") page %>% html_nodes(".biz-name span") %>% html_attr('href') But the code always returns 'NA'. Can anyone help me with that? Thanks! 回答1: library

Web scraping of key stats in Yahoo! Finance with R

阅读更多关于 Web scraping of key stats in Yahoo! Finance with R

问题 Is anyone experienced in scraping data from the Yahoo! Finance key statistics page with R? I am familiar scraping data directly from html using read_html , html_nodes() , and html_text() from rvest package. However, this web page MSFT key stats is a bit complicated, I am not sure if all the stats are kept in XHR, JS, or Doc. I am guessing the data is stored in JSON. If anyone knows a good way to extract and parse data for this web page with R, kindly answer my question, great thanks in

What's my user agent when I parse website with rvest package in R?

阅读更多关于 What's my user agent when I parse website with rvest package in R?

Since it is easy in R, I am using rvest package to parse HTML to extract informations from website. I am wondering what's my User-Agent (if there is any) during the request, since User-Agent is assigned to the internet browser or is there a way to set it somehow? My code that open session and extract informations from HTML is below: library(rvest) se <- html_session( "http://www.wp.pl" ) %>% html_nodes("[data-st-area=Glonews-mozaika] li:nth-child(7) a") %>% html_attr( name = "href" ) I used https://httpbin.org/user-agent to find out: library(rvest) se <- html_session( "https://httpbin.org/user