rvest

rvest how to select a specific css node by id

别说谁变了你拦得住时间么 提交于 2019-12-03 06:26:29
I'm trying to use the rvest package to scrape data from a web page. In a simple format, the html code looks like this: <div class="style"> <input id="a" value="123"> <input id="b"> </div> I want to get the value 123 from the first input. I tried the following R code: library(rvest) url<-"xxx" output<-html_nodes(url, ".style input") This will return a list of input tags: [[1]] <input id="a" value="123"> [[2]] <input id="b"> Next I tried using html_node to reference the first input tag by id: html_node(output, "#a") Here it returned a list of nulls instead of the input tag I want. [[1]] NULL [[2

R Change IP Address programmatically

感情迁移 提交于 2019-12-02 21:30:26
Currently changing user_agent by passing different strings to the html_session() method. Is there also a way to change your IP address on a timer when scraping a website? You can use a proxy (which changes your ip) via use_proxy as follows: html_session("you-url", use_proxy("proxy-ip", port)) For more details see: ?httr::use_proxy To check if it is working you can do the following: require(httr) content(GET("https://ifconfig.co/json"), "parsed") content(GET("https://ifconfig.co/json", use_proxy("138.201.63.123", 31288)), "parsed") The first call will return your IP. The second call should

scrape multiple linked HTML tables in R and rvest

偶尔善良 提交于 2019-12-02 21:04:49
This article http://www.ajnr.org/content/30/7/1402.full contains four links to html-tables which I would like to scrape with rvest. With help of the css selector: "#T1 a" it's possible to get to the first table like this: library("rvest") html_session("http://www.ajnr.org/content/30/7/1402.full") %>% follow_link(css="#T1 a") %>% html_table() %>% View() The css-selector: ".table-inline li:nth-child(1) a" makes it possible to select all four html-nodes containing the tags linking to the four tables: library("rvest") html("http://www.ajnr.org/content/30/7/1402.full") %>% html_nodes(css=".table

R: Using rvest package instead of XML package to get links from URL

谁说胖子不能爱 提交于 2019-12-02 21:01:01
I use XML package to get the links from this url . # Parse HTML URL v1WebParse <- htmlParse(v1URL) # Read links and and get the quotes of the companies from the href t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href')) While this method is very efficient, I've used rvest and seems faster at parsing a web than XML . I tried html_nodes and html_attrs but I can't get it to work. Despite my comment, here's how you can do it with rvest . Note that we need to read in the page with htmlParse first since the site has the content-type set to text/plain for that file and that tosses

Css selector issue with rvest and NHL statistics

怎甘沉沦 提交于 2019-12-02 19:31:24
问题 I want to scrape data from hockey-reference.com, specifically from this link: https://www.hockey-reference.com/leagues/NHL_1991.html I want the 4th table, called "Team Statistics," and I also want to subtract the first and last rows (but that can be for another time). Initially, I want to get the scrape working with the 1991 link, but I want to eventually scrape every link from 1991 to 2017. library(tidyverse) library(rvest) stat_urls <- "https://www.hockey-reference.com/leagues/NHL_1991.html

Empty nodes when scraping links with rvest in R

老子叫甜甜 提交于 2019-12-02 13:23:18
My goal is to get links to all challenges of Kaggle with their title. I am using the library rvest for it but I do not seem to come far. The nodes are empty when I am a few divs in. I am trying to do it for the first challenge at first and should be able to transfer that to every entry afterwards. The xpath of the first entry is: /html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a My idea was to get the link via html_attr( , "href") once I am in the right tag. My idea is: library(rvest) url = "https://www.kaggle.com/competitions" kaggle_html = read

Css selector issue with rvest and NHL statistics

那年仲夏 提交于 2019-12-02 10:45:17
I want to scrape data from hockey-reference.com, specifically from this link: https://www.hockey-reference.com/leagues/NHL_1991.html I want the 4th table, called "Team Statistics," and I also want to subtract the first and last rows (but that can be for another time). Initially, I want to get the scrape working with the 1991 link, but I want to eventually scrape every link from 1991 to 2017. library(tidyverse) library(rvest) stat_urls <- "https://www.hockey-reference.com/leagues/NHL_1991.html" Right now, I have the 1991 link only, for simplicity. I cannot seem to find the correct css selection

encoding error with read_html

旧时模样 提交于 2019-12-02 08:14:06
I am trying to web scrape a page . I thought of using the package rvest. However, I'm stuck in the first step, which is to use read_html to read the content. Here´s my code: library(rvest) url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956" obra_caridade <- read_html(url, encoding = "ISO-8895-1") And I got the following error: Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE3 0x6F 0x20 0x65 [9] I tried using what similar questions had as answers, but it did not solve my issue: obra

Scrape values from HTML select/option tags in R

让人想犯罪 __ 提交于 2019-12-02 06:33:08
问题 I'm trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I've managed to scrape the HTML and parse it but now a little unsure how to extract the bits I actually need! Using the XML library I scrape my data using this code: majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=") majidata_html <- htmlTreeParse(content(majidata_get, as="text")) This leaves me with (Large) XMLDocumentContent. There is a drop-down list on the webpage

rvest: Return NAs for empty nodes given multiple listings

二次信任 提交于 2019-12-02 06:07:44
I am fairly new to R (and using it for web scraping in particular), so any help is greatly appreciated. I am currently trying to mine a webpage that contains multiple ticket listings and lists additional details for some of these (like the ticket having an impaired view or being for children only). I want to extract this data, leaving blank spaces or NAs for the ticket listings that do not contain these details. Since the original website requires the use of RSelenium, I have tried to replicate the HTML in a simpler form. If any information is missing, please let me know and I will try to