rvest | 易学教程

Rvest: Scrape multiple URLs

阅读更多关于 Rvest: Scrape multiple URLs

问题 I am trying to scrape some IMDB data looping through a list of URLs. Unfortunately my output isn't exactly what I hoped for, never mind storing it in a dataframe. I get URLs with library(rvest) topmovies <- read_html("http://www.imdb.com/chart/top") links <- top250 %>% html_nodes(".titleColumn") %>% html_nodes("a") %>% html_attr("href") links_full <- paste("http://imdb.com",links,sep="") links_full_test <- links_full[1:10] and then I could get content with lapply(links_full_test, . %>% read

R web scraping across multiple pages

阅读更多关于 R web scraping across multiple pages

I am working on a web scraping program to search for specific wines and return a list of local wines of that variety. The problem I am having is multiple page results. The code below is a basic example of what I am working with url2 <- "http://www.winemag.com/?s=washington+merlot&search_type=reviews" htmlpage2 <- read_html(url2) names2 <- html_nodes(htmlpage2, ".review-listing .title") Wines2 <- html_text(names2) For this specific search there are 39 pages of results. I know the url changes to http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=2 , but is there an easy way to

Submit form with no submit button in rvest

阅读更多关于 Submit form with no submit button in rvest

I'm trying write a crawler to download some information, similar to this Stack Overflow post. The answer is useful for creating the filled-in form, but I'm struggling to find a way to submit the form when a submit button is not part of the form. Here is an example: session <- html_session("www.chase.com") form <- html_form(session)[[3]] filledform <- set_values(form, `user_name` = user_name, `usr_password` = usr_password) session <- submit_form(session, filledform) At this point, I receive this error: Error in names(submits)[[1]] : subscript out of bounds How can I make this form submit? Here

How to scrape tables inside a comment tag in html with R?

阅读更多关于 How to scrape tables inside a comment tag in html with R?

I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!-- What is the best way to get the tables from inside the comment tags? Thanks! Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none Ok..got it. library(stringi) library(knitr) library(rvest) any_version_html <- function(x){ XML:

loop across multiple urls in r with rvest [duplicate]

阅读更多关于 loop across multiple urls in r with rvest [duplicate]

问题 This question already has an answer here : Harvest (rvest) multiple HTML pages from a list of urls (1 answer) Closed 3 years ago . I have a series of 9 urls that I would like to scrape data from: http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=

R - How to make a click on webpage using rvest or rcurl

阅读更多关于 R - How to make a click on webpage using rvest or rcurl

I want to download data from this webpage The data can be easily scraped with rvest . The code maybe like this : library(rvest) library(pipeR) url <- "http://www.tradingeconomics.com/" css <- "#ctl00_ContentPlaceHolder1_defaultUC1_CurrencyMatrixAllCountries1_GridView1" data <- url %>>% html() %>>% html_nodes(css) %>>% html_table() But there is a problem for webpages like this. There is a + button to show the data of all the countries, but the default is just data of 50 countries. So if I use the code, I can just scrape data of 50 countries. The + button is made in javascript , so I want to

Scraping with rvest - complete with NAs when tag is not present

阅读更多关于 Scraping with rvest - complete with NAs when tag is not present

I want to parse this HTML: and get this elements from it: a) p tag, with class: "normal_encontrado" . b) div with class: "price" . Sometimes, the p tag is not present in some products. If this is the case, an NA should be added to the vector collecting the text from this nodes. The idea is to have 2 vectors with the same length, and after join them to make a data.frame . Any ideas? The HTML part: <html> <head></head> <body> <div class="product_price" id="product_price_186251"> <p class="normal_encontrado"> S/. 2,799.00 </p> <div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class=

Inputting NA where there are missing values when scraping with rvest

阅读更多关于 Inputting NA where there are missing values when scraping with rvest

问题 I want to use rvest to scrape a page which has titles and run times of talks at a recent conference and then combine the values into a tibble library(tibble) library(rvest) url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14" title <- page %>% html_nodes("h3 a") %>% html_text() length <- page %>% html_nodes(".tile .caption") %>% html_text() df <- tibble(title,length) If you look at the page

Equivalent of which in scraping?

阅读更多关于 Equivalent of which in scraping?

问题 I'm trying to run some scraping where the action I take on a node is conditional on the contents of the node. This should be a minimal example: XML = '<td class="id-tag"> <span title="Really Long Text">Really L...</span> </td> <td class="id-tag">Short</td>' page = read_html(XML) Basically, I want to extract html_attr(x, "title") if <span> exists, otherwise just get html_text(x) . Code to do the first is: page %>% html_nodes(xpath = '//td[@class="id-tag"]/span') %>% html_attr("title") # [1]

Using rvest or httr to log in to non-standard forms on a webpage

阅读更多关于 Using rvest or httr to log in to non-standard forms on a webpage

I am attempting to use rvest to spider a webpage that requires an email/password login on a form. rm(list=ls()) library(rvest) ### Trying to sign into a form using email/password url <-"http://www.perfectgame.org/" ## page to spider pgsession <-html_session(url) ## create session pgform <-html_form(pgsession)[[1]] ## pull form from session set_values(pgform, `ctl00$Header2$HeaderTop1$tbUsername` = "myemail@gmail.com") set_values(pgform, `ctl00$Header2$HeaderTop1$tbPassword` = "mypassword") submit_form(pgsession,pgform,submit=`ctl00$Header2$HeaderTop1$Button1`) This gives me the following error