rvest | 易学教程

Save response from web-scraping as csv file

阅读更多关于 Save response from web-scraping as csv file

问题 I downloaded a file from a website with rvest . How can I save the response as a csv file? Step 1 : Monkey patch rvest package like in this thread: How to submit login form in Rvest package w/o button argument library(tidyverse) library(rvest) library(R.utils) # monkey path submit_form custom.submit_request <- function (form, submit = NULL) { is_submit <- function(x) { if (!exists("type", x) | is.null(x$type)){ return(F); } tolower(x$type) %in% c("submit", "image", "button") } submits <-

rvest package - Is it possible for html_text() to store an NA value if it does not find an attribute?

阅读更多关于 rvest package - Is it possible for html_text() to store an NA value if it does not find an attribute?

问题 As the title states, I'm curious if it is possible for the html_text() function from the rvest package to store an NA value if it is not able to find an attribute on a specific page. I'm currently running a scrape over 199 pages (which works fine; tested on a few variables already). Currently, when I search for a value that is only present on a some (136) of the 199 pages, html_text() is only returning a vector of 136 strings. This is not useful because without NA s I am unable to determine

Rvest extract option value and text from select

阅读更多关于 Rvest extract option value and text from select

问题 Rvest select option, I think it is easiest to explain with an example reproducible Website: http://www.verema.com/vinos/portada I want to get the types of wines (Tipos de vinos), in html code is: <select class="campo select" id="producto_tipo_producto_id" name="producto[tipo_producto_id]"> <option value="">Todos</option> <option value="211">Tinto</option> <option value="213">Blanco</option> <option value="215">Rosado</option> <option value="216">Espumoso</option> <option value="217">Dulces y

add new field to form with rvest

阅读更多关于 add new field to form with rvest

问题 I'm trying to download [the full] dynamically expanded [holdings] table using rvest, but am getting an Unknown field names error. s <- html_session("http://innovatoretfs.com/etf/?ticker=ffty") f <- html_form(s)[[1]] #the following line fails: f.new <- set_values(f, `__EVENTTARGET` = "ctl00$BodyPlaceHolder$ViewHoldingsLinkButton") ##subsequent lines are not tested## doc <- submit_form(s, f.new) tabs <- xml_find_all(doc, "//table") holdings <- html_table(tabs, fill = T, trim = T)[[5]] I'm not

harvesting data via drop down list in R

阅读更多关于 harvesting data via drop down list in R

问题 I am trying to harvest data from this website http://www.lkcr.cz/seznam-lekaru-426.html (it's in Czech) I need to go through every possible combination of "Okres"(region) and "Obor"(specialization) I tried rvest, but it does not seem to find that there is any dropdown list, html_form returns list of length 0. therefore, as I am still a newbie in R, how can I "ask" the webpage to show me new combination of pages? thank you JH 回答1: I'd use the following: library(rvest) library(dplyr) library

rvest table scraping including links

阅读更多关于 rvest table scraping including links

问题 I would like to scrape some table data from Wikipedia. Some of the table columns include links to other articles I'd like to preserve. I've tried this approach, which didn't preserve the URLs. Looking at the html_table() function description, I didn't find any options of including those. Is there another package or way to do this? library("rvest") url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes" simp <- url %>% html() %>% html_nodes(xpath='//*[@id="mw-content-text"]/table[3

rvest - scrape 2 classes in 1 tag

阅读更多关于 rvest - scrape 2 classes in 1 tag

问题 I am new to rvest. How do I extract those elements with 2 class names or only 1 class name in tag? This is my code and issue: doc <- paste("<html>", "<body>", "<span class='a1 b1'> text1 </span>", "<span class='b1'> text2 </span>", "</body>", "</html>" ) library(rvest) read_html(doc) %>% html_nodes(".b1") %>% html_text() #output: text1, text2 #what i want: text2 #I also want to extract only elements with 2 class names read_html(doc) %>% html_nodes(".a1 .b1") %>% html_text() # Output that i

How to post within a rvest html_session?

阅读更多关于 How to post within a rvest html_session?

问题 How can i post "within" a html session? So after i opened a session via a <- rvest::html_session(url) I tried: library(httr) POST(path, add_headers(setNames(as.character(headers(a)), names(headers(a)))), set_cookies(setNames(cookies(a)$value, cookies(a)$name)), body = list(...), encode = "json") But this handles my request as I were not logged in. Any suggestions? I am looking for something like POST(session, path, body, ...) 回答1: Ok, after some digging into it i solved it by using: x %>%

unable to install rvest package

阅读更多关于 unable to install rvest package

问题 I need to install rvest package for R version 3.1.2 (2014-10-31) I get these errors: checking whether the C++ compiler supports the long long type... no *** stringi cannot be built. Upgrade your C++ compiler's settings ERROR: configuration failed for package ‘stringi’ * removing ‘/usr/local/lib64/R/library/stringi’ ERROR: dependency ‘stringi’ is not available for package ‘stringr’ * removing ‘/usr/local/lib64/R/library/stringr’ ERROR: dependency ‘stringr’ is not available for package ‘httr’ *

Scraping linked HTML webpages by looping the rvest::follow_link() function

阅读更多关于 Scraping linked HTML webpages by looping the rvest::follow_link() function

问题 How can I loop the rvest::follow_link() function to scrape linked webpages? Use Case: Identify all Lego Movie cast members Follow all Lego Movie cast member links Grab a table of each movie (+ year) for all cast members The required selectors I need are below: library(rvest) lego_movie <- html("http://www.imdb.com/title/tt1490017/") lego_movie <- lego_movie %>% html_nodes(".itemprop , .character a") %>% html_text() # follow cast links (".itemprop .itemprop") # grab tables of all movies and