rvest

rvest: follow different links with same tag

点点圈 提交于 2019-12-11 11:52:59
问题 I'm doing a little project in R that involves scraping some football data from a website. Here's the link to one of the years of data: http://www.sports-reference.com/cfb/years/2007-schedule.html. As you can see, there is a "Date" column with the dates hyperlinked, this hyperlink takes you to the stats from that particular game, which is the data I would like to scrape. Unfortunately, a lot of games take place on the same dates, which means their hyperlinks are the same. So if I scrape the

rvest missing nodes --> NA

孤街浪徒 提交于 2019-12-11 10:09:35
问题 I'm trying to search for nodes in an html document using rvest in R. In the code below, I would like to know how return a NULL or NA when "s_BadgeTop*" is missing. It is only for academic purpose. <div style="margin-bottom:0.5em;"><div><div style="float:left;">Por </div><div style="float:left;"><a href="/gp/pdp/profile/XXX" ><span style = "font-weight: bold;">JOHN</span></a> (UK) - <a href="/gp/cdp/member-reviews/XXX">Ver todas las opiniones</a><br /><span class="cmtySprite s_BadgeTop1000 " >

R: using rvest and purrr:map_df to build a data frame: how to deal with incomplete input [duplicate]

↘锁芯ラ 提交于 2019-12-11 08:45:07
问题 This question already has answers here : Scraping with rvest - complete with NAs when tag is not present (4 answers) Closed 7 months ago . I am webscraping webpages with rvest and turning the collected data into a dataframe using purrr::map_df . The problem I ran into is that not all webpages have content on every html_nodes that I specify, and map_df is ignoring such incomplete webpages. I would want map_df to include said webpages and write NA wherever a html_nodes does not match content.

Download csv file from webpage after submitting form from dropdown using rvest package in R

百般思念 提交于 2019-12-11 07:23:11
问题 I am working on a webscraping project to download various csv files from this webpage: https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link I would like to be able to programmatically choose the various reported quarters on the drop down list, hit submit (note that the URL for the page doesnt change for each different quarter) and then "Download CSV" for each of the quarters. As a disclaimer, I am a novice to rvest and below is my attempt at the solution: I first checked

Phantomjs with R

两盒软妹~` 提交于 2019-12-11 07:15:49
问题 I am trying to scrape data from a web page. Since the page has a dynamic content, I used phantomjs to handle. But, with the codes I am using, I just can download the data seen on the web page. However, I need to input the date range and then submit to get all the data I want. Here are the codes i used, library(xml2) library(rvest) connection <- "pr.js" writeLines(sprintf("var page=require('webpage').create(); var fs = require('fs'); page.open('%s',function(){ console.log(page.content);//page

scraping an interactive table in R with rvest

青春壹個敷衍的年華 提交于 2019-12-11 05:18:12
问题 I'm trying to scrape the scrolling table from the following link: http://proximityone.com/cd114_2013_2014.htm I'm using rvest but am having trouble finding the correct xpath for the table. My current code is as follows: url <- "http://proximityone.com/cd114_2013_2014.htm" table <- gis_data_html %>% html_node(xpath = '//span') %>% html_table() Currently I get the error "no applicable method for 'html_table' applied to an object of class "xml_missing"" Anyone know what I would need to change to

get links while do web scraping to google in R

旧时模样 提交于 2019-12-11 04:28:43
问题 I am trying to get links of google while do a search, that is, all these links:. I have done this kind of scraping but in this case I do not understand why It doesn't work, so I run the following lines: library(rvest) url<-"https://www.google.es/search?q=Ediciones+Peña+sl+telefono" content_request<-read_html(url) content_request %>% html_nodes(".r") %>% html_attr("href") I have tried with other nodes and I obtain similar answers: content_request %>% html_nodes(".LC20lb") %>% html_attr("href")

Adding whitespace to text elements

余生长醉 提交于 2019-12-11 03:54:49
问题 is there a way to add whitespace to each elements that contain text? For this example: movie <- read_html("http://www.imdb.com/title/tt1490017/") cast <- html_nodes(movie, "#titleCast span.itemprop") cast %>% html_structure() [[1]] <span.itemprop [itemprop]> {text} [[2]] <span.itemprop [itemprop]> {text} I would want to add a trailing whitespace to each text element before using html_text() . I have another use case where I want to use html_text() higher up in the document hierarchy. The

rvest: “unknown field names” when attempting to set form

﹥>﹥吖頭↗ 提交于 2019-12-11 03:32:56
问题 I'm attempting to generate a web form to allow me to scrape data. library(rvest) url <- "https://iemweb.biz.uiowa.edu/pricehistory/pricehistory_SelectContract.cfm?market_ID=214" pg.form <- html_form(html(url)) which returns pg.form [[1]] <form> '<unnamed>' (POST PriceHistory_GetData.cfm) <input HIDDEN> 'Market_ID': 214 <select> 'Month' [1/12] <select> 'Year' [0/2] <input SUBMIT> '': Get Prices My mistake is to think that I need to set values for the Month and Year fields, but this is a

how to reuse a session to avoid repeated login when scraping with rvest?

不羁的心 提交于 2019-12-11 02:08:40
问题 I developed some codes to scraping traffic data based this topic.I need to scrape many pages after log in, but right now my codes seem repeatedly log in the site for each url. How can I ‘reuse’ the session to avoid repeated log in so that, hopefully, the codes can run faster? Here's the pseudo-code: generateURL <- function(siteID){return siteURL} scrapeContent <- function(siteURL, session, filled_form){return content} mainPageURL <- 'http://pems.dot.ca.gov/' pgsession <- html_session