rvest

How to scrape a table with rvest and xpath?

爱⌒轻易说出口 提交于 2019-12-06 22:48:40
问题 using the following documentation i have been trying to scrape a series of tables from marketwatch.com here is the one represented by the code bellow: The link and xpath are already included in the code: url <- "http://www.marketwatch.com/investing/stock/IRS/profile" valuation <- url %>% html() %>% html_nodes(xpath='//*[@id="maincontent"]/div[2]/div[1]') %>% html_table() valuation <- valuation[[1]] I get the following error: Warning message: 'html' is deprecated. Use 'read_html' instead. See

How do I close unused connections after read_html in R

六眼飞鱼酱① 提交于 2019-12-06 20:24:22
问题 I am quite new to R and am trying to access some information on the internet, but am having problems with connections that don't seem to be closing. I would really appreciate it if someone here could give me some advice... Originally I wanted to use the WebChem package, which theoretically delivers everything I want, but when some of the output data is missing from the webpage, WebChem doesn't return any data from that page. To get around this, I have taken most of the code from the package

rvest vs RSelenium results for text extracting

梦想与她 提交于 2019-12-06 15:04:28
问题 So far i am using RSelenium to extract the text of a Homepage, but i would like to Switch to a fast solution like rvest . library(rvest) url = 'https://www.r-bloggers.com' rvestResults <- read_html(url) %>% html_node('body') %>% html_text() library(RSelenium) remDr$navigate(url) rSelResults <- remDr$findElement( using = "xpath", value = "//body" )$getElementText() Comparing the results below Shows that rvest includes some JavaScript Code, while the RSelenium is much "cleaner". I am aware of

How to convert an HTML R object to character?

随声附和 提交于 2019-12-06 13:36:58
Here's my reproducible example: library(rvest) page <- html("http://google.com") class(page) page > as.character(page) Error in as.vector(x, "character") : cannot coerce type 'externalptr' to vector of type 'character' How can I convert page from an html class to a character vector so I can store it somewhere? The html functions like html_text or html_attr don't give me the whole source. I would like to store it so I can later re-load it with html(). Thanks. To save directly to a text file: capture.output(page, file="file.html") To store as a string: htmltxt <- paste(capture.output(page, file

Rvest scraping errors

只谈情不闲聊 提交于 2019-12-06 13:28:02
问题 Here's the code I'm running library(rvest) rootUri <- "https://github.com/rails/rails/pull/" PR <- as.list(c(100, 200, 300)) list <- paste0(rootUri, PR) messages <- lapply(list, function(l) { html(l) }) Up until this point it seems to work fine, but when I try to extract the text: html_text(messages) I get: Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : Unknown input of class: list Trying to extract a specific element: html_text(messages[1]) Can't do that either... Error in

rvest table scraping including links

折月煮酒 提交于 2019-12-06 11:53:31
I would like to scrape some table data from Wikipedia. Some of the table columns include links to other articles I'd like to preserve. I've tried this approach , which didn't preserve the URLs. Looking at the html_table() function description, I didn't find any options of including those. Is there another package or way to do this? library("rvest") url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes" simp <- url %>% html() %>% html_nodes(xpath='//*[@id="mw-content-text"]/table[3]') %>% html_table() simp <- simp[[1]] Try this library(XML) library(httr) url <- "http://en.wikipedia

web scrape with rvest

自闭症网瘾萝莉.ら 提交于 2019-12-06 09:31:48
I'm trying to grab a table of data using read_html from the r package rvest. I've tried the below code: library(rvest) raw <- read_html("https://demanda.ree.es/movil/peninsula/demanda/tablas/2016-01-02/2") I don't believe the above pulled the data from the table, since I see 'raw' is a list of 2: 'node:<externalptr>' and 'doc:<externalptr>' I've tried grabbing the xpath too: html_nodes(raw,xpath = '//*[(@id = "tabla_generacion")]//*[contains(concat( " ", @class, " " ), concat( " ", "ng-scope", " " ))]') Any advice on what to try next? Thanks. This website is using angular to make a call to get

rvest: extract tables with url's instead of text

岁酱吖の 提交于 2019-12-06 09:23:10
问题 The tables I would like to scrape have url's in them. If I run the code, I get only the column with description of url. How to get the table which actually has a column (in mycase the second column) with URLs instead of their descriptions), or having a full html code of an anchor? . I need it to extract two index codes from the URL's in the second column of table. The links that I would like to scrape look like: https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType

“Error: not compatible with STRSXP” on submit_form with rvest

北慕城南 提交于 2019-12-06 08:24:09
I've searched around stackoverflow and github but haven't seen a solution to this one. session <- read_html("http://www.whitepages.com") form1 <- html_form(session)[[1]] form2 <- set_values(form1, who = "john smith") submit_form(session, form) After the submit form line, I get the following: Submitting with '<unnamed>' Error: not compatible with STRSXP I've pieced together that this error is usually from mismatched types (strings and numeric, for example), but I can't tell where that might be happening. Any help would be greatly appreciated! I just had this problem myself, and I found that the

Creating a table by web-scraping using a loop

好久不见. 提交于 2019-12-06 05:54:39
问题 I'm attempting to webscrape tax-rates.org to get the average tax percentage for each county in Texas. I have a list of 255 counties in an csv file which I import as "TX_counties", it's a single column table. I have to create the URL for each county as a string, so I set d1 to the first cell using [i,1], then concat it into a URL string, perform the scrape, then add +1 to [i] which makes it go to the second cell for the next county name, and the process continues. The problem is I can't figure