rvest | 易学教程

R rvest encoding errors with UTF-8

阅读更多关于 R rvest encoding errors with UTF-8

问题 I'm trying to get this table from Wikipedia. The source of the file clamis it's UTF-8: > <!DOCTYPE html> <html lang="en" dir="ltr" class="client-nojs"> <head> > <meta charset="UTF-8"/> <title>List of cities in Colombia - Wikipedia, > the free encyclopedia</title> > ... However, when I try to get the table with rvest it shows weird characters where there should be accented (standard spanish) ones like á, é, etc. This is what I attempted: theurl <- "https://en.wikipedia.org/wiki/List_of_cities

Web scraping with R and rvest

阅读更多关于 Web scraping with R and rvest

I am experimenting with rvest to learn web scraping with R. I am trying to replicate the Lego example for a couple of other sections of the page and using selector gadget to id. I pulled the example from R Studio tutorial . With the code below, 1 and 2 work, but 3 does not. library(rvest) lego_movie <- html("http://www.imdb.com/title/tt1490017/") # 1 - Get rating lego_movie %>% html_node("strong span") %>% html_text() %>% as.numeric() # 2 - Grab actor names lego_movie %>% html_nodes("#titleCast .itemprop span") %>% html_text() # 3 - Get Meta Score lego_movie %>% html_node(".star-box-details a

Downloading a file after login using a https URL

阅读更多关于 Downloading a file after login using a https URL

I am trying to download an excel file, which I have the link to, but I am required to log in to the page before I can download the file. I have successfully passed the login page with rvest, rcurl and httr, but I am having an extremely difficult time downloading the file after I have logged in. url <- "https://website.com/console/login.do" download_url <- "https://website.com/file.xls" session <- html_session(url) form <- html_form(session)[[1]] filled_form <- set_values(form, userid = user, password = pass) ## Save main page url main_page <- submit_form(session, filled_form) download.file

R: scraping additional data after POST only works for first page

阅读更多关于 R: scraping additional data after POST only works for first page

I would like to scrape drug informations offered by the Swiss government for an University research project from: http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue= The page does offer a robotx.txt file, however, it's content is freely available to the public and I assume that scraping this data is unprohibited. This is an update of this question , since I made some progress. What I achieved so far # opens the first results page # opens the first link as a table at the end of the page library("rvest") library("dplyr") url <- "http://www

How to scrape a table with rvest and xpath?

阅读更多关于 How to scrape a table with rvest and xpath?

using the following documentation i have been trying to scrape a series of tables from marketwatch.com here is the one represented by the code bellow: The link and xpath are already included in the code: url <- "http://www.marketwatch.com/investing/stock/IRS/profile" valuation <- url %>% html() %>% html_nodes(xpath='//*[@id="maincontent"]/div[2]/div[1]') %>% html_table() valuation <- valuation[[1]] I get the following error: Warning message: 'html' is deprecated. Use 'read_html' instead. See help("Deprecated") Thanks in advance. That website doesn't use an html table, so html_table() can't

Scraping html table with span using rvest

阅读更多关于 Scraping html table with span using rvest

问题 I'm using rvest to extract the table in the following page: https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin The following code works: URL <- 'https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin' table <- URL %>% read_html %>% html_nodes("table") %>% .[[2]] %>% html_table(trim=TRUE) but the column of margins and president names have some strange values. The reason is that the source code have the

Missing elements when using `read_html` using `rvest` in R

阅读更多关于 Missing elements when using `read_html` using `rvest` in R

I'm trying to use the read_html function in the rvest package, but have come across a problem I am struggling with. For example, if I were trying to read in the bottom table that appears on this page, I would use the following code: library(rvest) html_content <- read_html("https://projects.fivethirtyeight.com/2016-election-forecast/washington/#now") By inspecting the HTML code in the browser, I can see that the content I would like is contained in a <table> tag (specifically, it is all contained within <table class="t-calc"> ). But when I try to extract this using: tables <- html_nodes(html

R: Webscraping a list of URLs to get a DataFrame

阅读更多关于 R: Webscraping a list of URLs to get a DataFrame

I can see the correct data, but cannot put it on a Data Frame (It appears as a list of elements). I think the problem is my understanding of the apply family functions. Any hint is welcome. Here is a similar question, but I think it is better to post mine as it contains more details: Webscraping content across multiple pages using rvest package library(rvest) library(lubridate) library(dplyr) urls <- list("http://simple.ripley.com.pe/tv-y-video/televisores/ver-todo-tv", "http://simple.ripley.com.pe/tv-y-video/televisores/ver-todo-tv?page=2&orderBy=seq", "http://simple.ripley.com.pe/tv-y-video

Rvest scraping errors

阅读更多关于 Rvest scraping errors

Here's the code I'm running library(rvest) rootUri <- "https://github.com/rails/rails/pull/" PR <- as.list(c(100, 200, 300)) list <- paste0(rootUri, PR) messages <- lapply(list, function(l) { html(l) }) Up until this point it seems to work fine, but when I try to extract the text: html_text(messages) I get: Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : Unknown input of class: list Trying to extract a specific element: html_text(messages[1]) Can't do that either... Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : Unknown input of class: list So I try a

How to post within a rvest html_session?

阅读更多关于 How to post within a rvest html_session?

How can i post "within" a html session? So after i opened a session via a <- rvest::html_session(url) I tried: library(httr) POST(path, add_headers(setNames(as.character(headers(a)), names(headers(a)))), set_cookies(setNames(cookies(a)$value, cookies(a)$name)), body = list(...), encode = "json") But this handles my request as I were not logged in. Any suggestions? I am looking for something like POST(session, path, body, ...) Ok, after some digging into it i solved it by using: x %>% rvest:::request_POST(url, config(referer = x$url), user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4)