rvest

Using rvest to scrape data that is not in table

柔情痞子 提交于 2020-08-26 10:17:07
问题 I'm trying to scrape some data from a website. I thought I could use rvest, but I'm having a lot of trouble getting data that is not in a table. I don't know if it's possible, or whether I'm using the wrong package? I am trying to get the website, name and address from the following html: <div class="info clearfix"> <i class="sprite icon title"></i> <p class="title"> <a target="_blank" href="https://test.com/regions/Tennis_Court.html"> Tennis Court</a> </p> <p class="location"> 123 Page St,

How to make “html_node” work for this website?

走远了吗. 提交于 2020-08-10 03:38:12
问题 I have an issue webscarping this website. If I try the "conventional" way, it works fine like in the code below: base_url <- "https://www.ecb.europa.eu" year_urls1 <- paste0(base_url, "/press/pressconf/", 2000:2008, "/html/index_include.en.html") scrape_page <- function(url) { Sys.sleep(runif(1)) html_attr(html_nodes(read_html(url), ".doc-title a"), name = "href") } all_pages1 <- lapply(year_urls1, scrape_page) all_pages1 <- paste0(base_url, unlist(all_pages1)) But now let's assume for x

How to make “html_node” work for this website?

泄露秘密 提交于 2020-08-10 03:38:01
问题 I have an issue webscarping this website. If I try the "conventional" way, it works fine like in the code below: base_url <- "https://www.ecb.europa.eu" year_urls1 <- paste0(base_url, "/press/pressconf/", 2000:2008, "/html/index_include.en.html") scrape_page <- function(url) { Sys.sleep(runif(1)) html_attr(html_nodes(read_html(url), ".doc-title a"), name = "href") } all_pages1 <- lapply(year_urls1, scrape_page) all_pages1 <- paste0(base_url, unlist(all_pages1)) But now let's assume for x

Scraping and extracting XML sitemap elements using R and Rvest

六月ゝ 毕业季﹏ 提交于 2020-06-28 05:57:10
问题 I need to extract a large number of XML sitemap elements from multiple xml files using Rvest. I have been able to extract html_nodes from webpages using xpaths, but for xml files this is new to me. And, I can't find a Stackoverflow question that lets me parse an xml file address, rather than parsing a large text chunk of XML. Example of what I have used for html: library(dplyr) library(rvest) webpage <- "https://www.example.co.uk/" data <- webpage %>% read_html() %>% html_nodes("any given

Scrape Data through RVest

百般思念 提交于 2020-06-27 05:26:42
问题 I am looking to get the article names by category from https://www.inquirer.net/article-index?d=2020-6-13 I've attempted to read the article names by doing: library('rvest') year <- 2020 month <- 06 day <- 13 url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "") pg <- read_html(url) test<-pg %>% html_nodes("#index-wrap") %>% html_text() This returns only 1 string of all articles names and it's very messy. I ultimately would like to have a dataframe that

Rvest reading separated article data

折月煮酒 提交于 2020-06-26 14:15:10
问题 I am looking to scrape article data from inquirer.net. This is a follow-up question to Scrape Data through RVest Here is the code that works based on the answer: library(rvest) #> Loading required package: xml2 library(tibble) year <- 2020 month <- 06 day <- 13 url <- paste0('http://www.inquirer.net/article-index?d=', year, '-', month, '-', day) div <- read_html(url) %>% html_node(xpath = '//*[@id ="index-wrap"]') links <- html_nodes(div, xpath = '//a[@rel = "bookmark"]') post_date <- html