rvest | 易学教程

Specifying column class in html_table(rvest)

阅读更多关于 Specifying column class in html_table(rvest)

问题 I am using the html_table from rvest to read a two-column concordance table from the website below. Both columns contain instances of leading zeros which I would want to preserve. As such, I would want the columns to be of class character. I use the following code: library(rvest) library(data.table) df <- list() for (j in 1:25) { url <- paste('http://unstats.un.org/unsd/cr/registry/regso.asp?Ci=70&Lg=1&Co=&T=0&p=', j, '&prn=yes', sep='') webpage <- read_html(url) table <- html_nodes(webpage,

Map a tbl of hyperlinks into read_html

阅读更多关于 Map a tbl of hyperlinks into read_html

问题 I have a tibble containing one column which stores hyperlinks in each column. Now I want to map over these links using map_dfr, passing the links one after another through read_html(.x[.x]) %>% html_node(".body-copy-lg") %>% html_text . If I do so I always end up with the error : Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Expecting a single string value: [type=character; extent=3]. Which tells me that the read_html basically says: " Hey stop

Map a tbl of hyperlinks into read_html

阅读更多关于 Map a tbl of hyperlinks into read_html

Triggering doPostBack javascript with RSelenium to scrap multi-page table

阅读更多关于 Triggering doPostBack javascript with RSelenium to scrap multi-page table

问题 I am struggling to 'web-scrap' data from a table which spans over several pages. The pages are linked via javascript. The data I am interested in is based on the website's search function: url <- "http://aims.niassembly.gov.uk/plenary/searchresults.aspx?tb=0&tbv=0&tbt=All%20Members&pt=7&ptv=7&ptt=Petition%20of%20Concern&mc=0&mcv=0&mct=All%20Categories&mt=0&mtv=0&mtt=All%20Types&sp=1&spv=0&spt=Tabled%20Between&ss=jc7icOHu4kg=&tm=2&per=1&fd=01/01/2011&td=17/04/2018&tit=0&txt=0&pm=0&it=0&pid=1

could not read webpage with read_html using rvest package from r

阅读更多关于 could not read webpage with read_html using rvest package from r

问题 I'm trying to scrape the location of product reviewers from amazon. For example, this webpage [https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8][1] I need to get HAINESVILLE, ILLINOIS, United States I use rvest package for webscraping. Here is what I did: library(rvest) url='https://www.amazon.com/gp/profile/amzn1.account.AH55KF4JK5IKKJ77MPOLHOR4YAQQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8' page = read_html(url) I got error like below: Error in

Scraping location data in rvest

阅读更多关于 Scraping location data in rvest

问题 I'm currently trying to scrape latitude/longitude data from a list of urls I have using rvest. Each URL has an embedded google map with a specific location, but the urls themselves don't show the path that the API is taking. When looking at the page source, I see that the part I'm after is here: <script type="text/javascript" src="http://maps.google.com/maps/api/js?sensor=false"> </script> <script type="text/javascript"> function initialize() { var myLatlng = new google.maps.LatLng(43.805170,

How to convert an HTML R object to character?

阅读更多关于 How to convert an HTML R object to character?

问题 Here's my reproducible example: library(rvest) page <- html("http://google.com") class(page) page > as.character(page) Error in as.vector(x, "character") : cannot coerce type 'externalptr' to vector of type 'character' How can I convert page from an html class to a character vector so I can store it somewhere? The html functions like html_text or html_attr don't give me the whole source. I would like to store it so I can later re-load it with html(). Thanks. 回答1: To save directly to a text

Scrape a URL with several tables with Rvest

阅读更多关于 Scrape a URL with several tables with Rvest

问题 I am trying to learn how to do some scrapping using rvest package. I´m using this url to load the information, and I am trying to get the information of the table marked as "advanced" in the URL: When I try to load the information, all I´m able to get is the first table. I mean, when I inspect using google chrome I see that the numbers in the table are marked as class="right". So this is what I tried: library(rvest) library(stringr) url = url("https://www.basketball-reference.com/players/l

Scrape and Loop with Rvest

阅读更多关于 Scrape and Loop with Rvest

问题 I have reviewed several answers to similar questions on SO related to this similar topic but neither seem to work for me. (loop across multiple urls in r with rvest) (Harvest (rvest) multiple HTML pages from a list of urls) I have a list of URLs and I wish to grab the table from each and append it to a master dataframe. ## get all urls into one list page<- (0:2) urls <- list() for (i in 1:length(page)) { url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i]) urls[[i]] <- url } #

Mangling of French unicode when webscraping with rvest

阅读更多关于 Mangling of French unicode when webscraping with rvest

问题 I'm looking at scraping a French website using the rvest package. library(rvest) url <- "https://www.vins-bourgogne.fr/nos-vins-nos-terroirs/tous-les-bourgognes/toutes-les-appellations-de-bourgogne-a-votre-portee,2378,9172.html?&args=Y29tcF9pZD0xMzg2JmFjdGlvbj12aWV3RnVsbExpc3RlJmlkPSZ8" s <- read_html(url) s %>% html_nodes('#resultatListeAppellation .lien') %>% html_text() I expect to see: Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE) Auxey-Duresses (Appellation Village,