rvest | 易学教程

Using rvest to scrape a website - Selecting html node?

阅读更多关于 Using rvest to scrape a website - Selecting html node?

问题 I have a question about my latest r vest scrape. I want to scrape this page (and some other stocks as well): http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1 I need a list of the Market Capital, which is the first box in the second line. This list should contain approx 50-100 stocks. I am using rvest for that. library(rvest) html = read_html("http://www.finviz.com/quote.ashx?t=A") cast = html_nodes(html, "table-dark-row") The problem is, I can not get around the html_nodes. Any idea about

Scrape with a loop and avoid 404 error

阅读更多关于 Scrape with a loop and avoid 404 error

问题 I am trying to scrape wiki for certain astronomy related definitions for my project. The code works pretty well, but I am not able to avoid 404s. I tried tryCatch . I think I am missing something here. I am looking for a way overcome 404s while running a loop. Here is my code: library(rvest) library(httr) library(XML) library(tm) topic<-c("Neutron star", "Black hole", "sagittarius A") for(i in topic){ site<- paste("https://en.wikipedia.org/wiki/", i) site <- read_html(site) stats<- xmlValue

Web scraping password protected website using R

阅读更多关于 Web scraping password protected website using R

问题 i would like to web scrap yammer data using R,but in order to do so first il have to login to this page,(which is authentication for an app that i created). https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg I am able to get the yammer data once i login to this page but all this is in browser by standard yammer urls (https://www.yammer.com/api/v1/messages/received.json) I have read through similar questions and tried the suggestions but still cant get through this

Scraping a JavaScript object and converting to JSON within R/Rvest

阅读更多关于 Scraping a JavaScript object and converting to JSON within R/Rvest

问题 I am scraping the following website: https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio I am trying to get the table of currency exchange rates into an R data frame via the rvest package, but the table itself is configured in a JavaScript variable within the HTML code. I located the relevant css selector and now I have this: library(rvest) banorte <- "https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio/" %>% read_html() %>% html_nodes('#indicadores

Rvest How to avoid the Error in open.connection(x, “rb”) : HTTP error 404 R

阅读更多关于 Rvest How to avoid the Error in open.connection(x, “rb”) : HTTP error 404 R

问题 I'd like to take some informations from a list of website. I have a list of urls, but there are some that doesn't work/exesist. The Error is: Error in open.connection(x, "rb") : HTTP error 404 R library(Rvest) url_web<-(c("https://it.wikipedia.org/wiki/Roma", "https://it.wikipedia.org/wiki/Milano", "https://it.wikipedia.org/wiki/Napoli", "https://it.wikipedia.org/wiki/Torinoooo", # for example this is an error "https://it.wikipedia.org/wiki/Palermo", "https://it.wikipedia.org/wiki/Venezia"))

With rvest, how to extract html contents from the object returned by submit_form()

阅读更多关于 With rvest, how to extract html contents from the object returned by submit_form()

问题 I am trying to download some traffic data from pems.dot.ca.gov, following this topic. rm(list=ls()) library(rvest) library(xml2) library(httr) url <- "http://pems.dot.ca.gov/?report_form=1&dnode=tmgs&content=tmg_volumes&tab=tmg_vol_ts&export=&tmg_station_id=74250&s_time_id=1369094400&s_time_id_f=05%2F21%2F2013&e_time_id=1371772740&e_time_id_f=06%2F20%2F2013&tod=all&tod_from=0&tod_to=0&dow_5=on&dow_6=on&tmg_sub_id=all&q=obs_flow&gn=hour&html.x=34&html.y=8" pgsession <- html_session(url) pgform

How to fetch headlines from google news using rvest R?

阅读更多关于 How to fetch headlines from google news using rvest R?

问题 I want to fetch headlines from google news using rvest in R. I have done this so far library(rvest) url=read_html("https://www.google.com/search?hl=en&tbm=nws&authuser=0&q=american+president") selector_name<-"r" fnames<-html_nodes(x = url, css = selector_name) %>% html_text() but the result is > fnames character(0) This is the inspect element of a headline? <h3 class="r"><a href="/browse.php/PbtvpluS/QDvUJpC7/KoWCA9QE/VTTOFmVJ/bIp8sMa8/qKjgkcAu/Hgcr9lyg/4bibGCOO/nZ82ojLo/_2B602Vo/0sOSEbba

Web Scrape: Select Fields from Drop Downs, Extract Resulting Data

阅读更多关于 Web Scrape: Select Fields from Drop Downs, Extract Resulting Data

问题 Try to do some webscraping in R and could use some help. I would like to extract the data in the table at this page http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx But I would like to first select County from the left most drop down, then select Alameda County (CA) from the next dropdown, then scrape the data in the table. This is what I have so far, but I think I know why its not working - rvest form functions are suited to filling out a basic form not selecting from drop downs on

Web scraping the data behind every url from a list of urls

阅读更多关于 Web scraping the data behind every url from a list of urls

I am trying to gather a dataset from this site called ICObench . I've managed to extract the names of each ICO in the 91 pages using rvest and purrr, but Im confused as to how I can extract data behind each name in the list. All the names are clickable links. This is the code so far: url_base <- "https://icobench.com/icos?page=%d&filterBonus=&filterBounty=&filterTeam=&filterExpert=&filterSort=&filterCategory=all&filterRating=any&filterStatus=ended&filterCountry=any&filterRegistration=0&filterExcludeArea=none&filterPlatform=any&filterCurrency=any&filterTrading=any&s=&filterStartAfter=

Downloading a file after login using a https URL

阅读更多关于 Downloading a file after login using a https URL

问题 I am trying to download an excel file, which I have the link to, but I am required to log in to the page before I can download the file. I have successfully passed the login page with rvest, rcurl and httr, but I am having an extremely difficult time downloading the file after I have logged in. url <- "https://website.com/console/login.do" download_url <- "https://website.com/file.xls" session <- html_session(url) form <- html_form(session)[[1]] filled_form <- set_values(form, userid = user,