rvest | 易学教程

rvest: follow different links with same tag

阅读更多关于 rvest: follow different links with same tag

问题 I'm doing a little project in R that involves scraping some football data from a website. Here's the link to one of the years of data: http://www.sports-reference.com/cfb/years/2007-schedule.html. As you can see, there is a "Date" column with the dates hyperlinked, this hyperlink takes you to the stats from that particular game, which is the data I would like to scrape. Unfortunately, a lot of games take place on the same dates, which means their hyperlinks are the same. So if I scrape the

rvest missing nodes --> NA

阅读更多关于 rvest missing nodes --> NA

问题 I'm trying to search for nodes in an html document using rvest in R. In the code below, I would like to know how return a NULL or NA when "s_BadgeTop*" is missing. It is only for academic purpose. <div style="margin-bottom:0.5em;"><div><div style="float:left;">Por </div><div style="float:left;"><a href="/gp/pdp/profile/XXX" ><span style = "font-weight: bold;">JOHN</span></a> (UK) - <a href="/gp/cdp/member-reviews/XXX">Ver todas las opiniones</a><br /><span class="cmtySprite s_BadgeTop1000 " >

R: using rvest and purrr:map_df to build a data frame: how to deal with incomplete input [duplicate]

阅读更多关于 R: using rvest and purrr:map_df to build a data frame: how to deal with incomplete input [duplicate]

问题 This question already has answers here : Scraping with rvest - complete with NAs when tag is not present (4 answers) Closed 7 months ago . I am webscraping webpages with rvest and turning the collected data into a dataframe using purrr::map_df . The problem I ran into is that not all webpages have content on every html_nodes that I specify, and map_df is ignoring such incomplete webpages. I would want map_df to include said webpages and write NA wherever a html_nodes does not match content.

Download csv file from webpage after submitting form from dropdown using rvest package in R

阅读更多关于 Download csv file from webpage after submitting form from dropdown using rvest package in R

问题 I am working on a webscraping project to download various csv files from this webpage: https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link I would like to be able to programmatically choose the various reported quarters on the drop down list, hit submit (note that the URL for the page doesnt change for each different quarter) and then "Download CSV" for each of the quarters. As a disclaimer, I am a novice to rvest and below is my attempt at the solution: I first checked

Phantomjs with R

阅读更多关于 Phantomjs with R

问题 I am trying to scrape data from a web page. Since the page has a dynamic content, I used phantomjs to handle. But, with the codes I am using, I just can download the data seen on the web page. However, I need to input the date range and then submit to get all the data I want. Here are the codes i used, library(xml2) library(rvest) connection <- "pr.js" writeLines(sprintf("var page=require('webpage').create(); var fs = require('fs'); page.open('%s',function(){ console.log(page.content);//page

scraping an interactive table in R with rvest

阅读更多关于 scraping an interactive table in R with rvest

问题 I'm trying to scrape the scrolling table from the following link: http://proximityone.com/cd114_2013_2014.htm I'm using rvest but am having trouble finding the correct xpath for the table. My current code is as follows: url <- "http://proximityone.com/cd114_2013_2014.htm" table <- gis_data_html %>% html_node(xpath = '//span') %>% html_table() Currently I get the error "no applicable method for 'html_table' applied to an object of class "xml_missing"" Anyone know what I would need to change to

get links while do web scraping to google in R

阅读更多关于 get links while do web scraping to google in R

问题 I am trying to get links of google while do a search, that is, all these links:. I have done this kind of scraping but in this case I do not understand why It doesn't work, so I run the following lines: library(rvest) url<-"https://www.google.es/search?q=Ediciones+Peña+sl+telefono" content_request<-read_html(url) content_request %>% html_nodes(".r") %>% html_attr("href") I have tried with other nodes and I obtain similar answers: content_request %>% html_nodes(".LC20lb") %>% html_attr("href")

Adding whitespace to text elements

阅读更多关于 Adding whitespace to text elements

问题 is there a way to add whitespace to each elements that contain text? For this example: movie <- read_html("http://www.imdb.com/title/tt1490017/") cast <- html_nodes(movie, "#titleCast span.itemprop") cast %>% html_structure() [[1]] <span.itemprop [itemprop]> {text} [[2]] <span.itemprop [itemprop]> {text} I would want to add a trailing whitespace to each text element before using html_text() . I have another use case where I want to use html_text() higher up in the document hierarchy. The

rvest: “unknown field names” when attempting to set form

阅读更多关于 rvest: “unknown field names” when attempting to set form

问题 I'm attempting to generate a web form to allow me to scrape data. library(rvest) url <- "https://iemweb.biz.uiowa.edu/pricehistory/pricehistory_SelectContract.cfm?market_ID=214" pg.form <- html_form(html(url)) which returns pg.form [[1]] <form> '<unnamed>' (POST PriceHistory_GetData.cfm) <input HIDDEN> 'Market_ID': 214 <select> 'Month' [1/12] <select> 'Year' [0/2] <input SUBMIT> '': Get Prices My mistake is to think that I need to set values for the Month and Year fields, but this is a

how to reuse a session to avoid repeated login when scraping with rvest?

阅读更多关于 how to reuse a session to avoid repeated login when scraping with rvest?

问题 I developed some codes to scraping traffic data based this topic.I need to scrape many pages after log in, but right now my codes seem repeatedly log in the site for each url. How can I ‘reuse’ the session to avoid repeated log in so that, hopefully, the codes can run faster? Here's the pseudo-code: generateURL <- function(siteID){return siteURL} scrapeContent <- function(siteURL, session, filled_form){return content} mainPageURL <- 'http://pems.dot.ca.gov/' pgsession <- html_session