rvest | 易学教程

How can I POST a simple HTML form in R?

阅读更多关于 How can I POST a simple HTML form in R?

问题 I'm relatively new to R programming and I'm trying to put some of the stuff I'm learning in the Johns Hopkins Data Science track to practical use. Specifically, I would like to automate the process of downloading historical bond prices from the US Treasury website Using both Firefox and R, I was able to determine that the US Treasury website uses a very simple HTML POST form to specify a single date for the quotes of interest. It then returns a table of secondary market information for all

Package “rvest” for web scraping https site with proxy

阅读更多关于 Package “rvest” for web scraping https site with proxy

问题 I want to scrap a https website, but I failed. Here is my code: require(rvest) url <- "https://www.sunnyplayer.com/de/" content <- read_html(url) But I have error in console- "Error in open.connection(x, "rb") : Timeout was reached" How I can fix this problem? 回答1: The same thing happens to me on a proxy. To get around this, use download.file and specify a download location. You can then parse the file with read_html. download.file(url, destfile = 'C://whatever.html') content <- read_html('C:

Scraping with rvest - complete with NAs when tag is not present

阅读更多关于 Scraping with rvest - complete with NAs when tag is not present

问题 I want to parse this HTML: and get this elements from it: a) p tag, with class: "normal_encontrado" . b) div with class: "price" . Sometimes, the p tag is not present in some products. If this is the case, an NA should be added to the vector collecting the text from this nodes. The idea is to have 2 vectors with the same length, and after join them to make a data.frame . Any ideas? The HTML part: <html> <head></head> <body> <div class="product_price" id="product_price_186251"> <p class=

Read HTML Table Into Data Frame with Hyperlinks in R

阅读更多关于 Read HTML Table Into Data Frame with Hyperlinks in R

问题 I am trying to read an HTML table from a publicly-accessible website into a data frame in R. The final column of the table contains hyperlinks, and I would like to read these hyperlinks into the table rather than the text that is displayed on the webpage. I've reviewed several posts here on StackOverflow and on other sites and have gotten almost there, but I haven't been able to read the hyperlinks themselves. The table I'm trying to read is here: http://mis.ercot.com/misapp/GetReports.do

Read HTML Table Into Data Frame with Hyperlinks in R

阅读更多关于 Read HTML Table Into Data Frame with Hyperlinks in R

How to use rvest to web crawling correctly?

阅读更多关于 How to use rvest to web crawling correctly?

问题 I try to web crawl this page http://www.funda.nl/en/koop/leiden/ to get the max page it could show which is 29. I followed some online tutorial and located where 29 is in the html code, wrote this R code url<- read_html("http://www.funda.nl/en/koop/leiden/") url %>% html_nodes("#pagination-number.pagination-last") %>% html_attr("data- pagination-page") %>% as.numeric() However, what I got is numeric(0) . If I remove as.numeric() , I get character(0) . How is this done ? 回答1: I believe that

Web Scraping with rvest and R

阅读更多关于 Web Scraping with rvest and R

问题 I am trying to web scrape the total assets of a particular fund in this case ADAFX from http://www.morningstar.com/funds/xnas/adafx/quote.html. But the result is always charecter (empty); what am I doing wrong? I have used rvest before with mixed results, so I figured time to get expert help from the community of trusted gurus (thats you). library(rvest) Symbol.i ="ADAFX" url <-Paste("http://www.morningstar.com/funds/xnas/",Symbol.i,"/quote.html",sep="") tryCatch(NetAssets.i <- url %>% read

Using R to scrape tables when URL does not change

阅读更多关于 Using R to scrape tables when URL does not change

问题 I'm relatively new to scraping in R and have had great luck using "rvest", but I've run into an issue I cannot solve. The website I am trying to scrape has the same URL no matter what page of the table you are on. For example, the main webpage is www.blah.com with one main table on it that has 10 other "next" pages of the same table, but just the next in order (I apologize for not linking to the actual page as I cannot due to work issues). So, if I'm on page 1 of the table, the URL is www

Looping in RSelenium and Scraping

阅读更多关于 Looping in RSelenium and Scraping

问题 I'm trying to scrape data from website using RSelenium. I am able to navigate through drop downs individually but when I run them in loop I get error. Also after selecting all the values in the drop down I want to store the name of the facility and contact details in a table. Which I'm not able to do so far. rm(list=ls()) setwd("D:\\work_codes\\kvk\\data") getwd() library(RSelenium) library(rvest) library(XML) library(RCurl) library(magrittr) library(stringr) rd<-rsDriver() remDr<-rd[["client

Scrape first class node but not child using rvest

阅读更多关于 Scrape first class node but not child using rvest

问题 many questions on this but couldn't see the answer I'm looking for. Looking to extract a specific text, with a class .quoteText which with my code works, but also extracts all of the child nodes within .quoteText : url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93" quote_text <- function(html){ path <- read_html(html) path %>% html_nodes(".quoteText") %>% html_text(trim = TRUE) %>% str_trim(side = "both") %>% unlist() } quote_text(url) with the result