rvest | 易学教程

Rvest reading separated article data

阅读更多关于 Rvest reading separated article data

问题 I am looking to scrape article data from inquirer.net. This is a follow-up question to Scrape Data through RVest Here is the code that works based on the answer: library(rvest) #> Loading required package: xml2 library(tibble) year <- 2020 month <- 06 day <- 13 url <- paste0('http://www.inquirer.net/article-index?d=', year, '-', month, '-', day) div <- read_html(url) %>% html_node(xpath = '//*[@id ="index-wrap"]') links <- html_nodes(div, xpath = '//a[@rel = "bookmark"]') post_date <- html

How do I find html_node on search form?

阅读更多关于 How do I find html_node on search form?

问题 I have a list of names (first name, last name, and date-of-birth) that I need to search the Fulton County Georgia (USA) Jail website to determine if a person is in or released from jail. The website is http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400 The site requires you enter a last name and first name, then it gives you a list of results. I have found some stackoverflow posts that have given me some direction, but I'm still struggling to figure this out. I"m using

Scraping HTML Text from a <dl> Tag

阅读更多关于 Scraping HTML Text from a Tag

问题 I have a description list that I downloaded from a website with an agenda, and I am trying to create a data.frame without success. the description list has the following structure: <dl> <dt> (which contains a <p = "day"> for day) <dd> (which contains a <p = "hour"> for hour and a <p = "event"> for the event) I managed to extract this data with the following code: library(rvest) url <- read_html("www.mypage.com") day <- data.frame(day = html_text(html_nodes(url, '.day'))) hour <- data.frame

How to extract text from a several “div class” (html) using R?

阅读更多关于 How to extract text from a several “div class” (html) using R?

问题 My goal is to extract info from this html page to create a database: https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing One of the variables is the price of the apartments. I've identified that some have the div class="row_price" code which includes the price (example A) but others don't have this code and therefore the price (example B). Hence I would like that R could read the observations without the price as NA , otherwise it will mixed the database by giving

How to extract text from a several “div class” (html) using R?

阅读更多关于 How to extract text from a several “div class” (html) using R?

Using rvest with drake: external pointer is not valid error

阅读更多关于 Using rvest with drake: external pointer is not valid error

问题 When I first run the code below, everything is ok. But when I change something in html_file %>%... comand, for example commenting tolower() , I get the following error: Error: target title failed. diagnose(title)error$message: external pointer is not valid diagnose(title)error$calls: 1. └─html_file %>% html_nodes("h2") %>% html_text() Code: library(rvest) library(drake) some_string <- ' <div class="main"> <h2>A</h2> <div class="route">X</div> </div> ' html_file <- read_html(some_string) title

Using rvest with drake: external pointer is not valid error

阅读更多关于 Using rvest with drake: external pointer is not valid error

How do i find all nodes without children (starting from non-root node!) in xpath/R?

阅读更多关于 How do i find all nodes without children (starting from non-root node!) in xpath/R?

问题 I know how to find all nodes that dont have a child node: library(rvest) library(magrittr) doc <- "https://www.r-bloggers.com/" %>% GET %>% content leafes <- doc %>% html_nodes(xpath = "//*[not(descendant::*)]") length(leafes) Now i try the same from nodes that are not the root node: doc <- "https://www.r-bloggers.com/" %>% GET %>% content tags <- doc %>% html_nodes(xpath = "/html/body/div/div/div/div/h2/a") nonRootNodeWithChildr <- tags %>% html_nodes(xpath = "..") %>% html_nodes(xpath = "..

Web scraping rvest problems basketball players

阅读更多关于 Web scraping rvest problems basketball players

问题 I'm having trouble reading the data from the url https://www.basketball-reference.com/leagues/NBA_2020_totals.html#totals_stats::pts. Here's the code: library(rvest) url <- "https://www.basketball-reference.com/leagues/NBA_2020_totals.html#totals_stats::pts" pagina <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE, encoding = "utf-8") pagina %>% html_nodes("table") %>% .[[1]] %>% html_table(fill=T) -> x This reads the table, but I don't know why it paste a few rows like this: Rk

r rvest error: “Error in doc_namespaces(doc) : external pointer is not valid”

阅读更多关于 r rvest error: “Error in doc_namespaces(doc) : external pointer is not valid”

问题 My question is similar to this one, but the latter did not receive an answer I can work with. I am scraping thousands of urls with xml2::read_html . This works fine. But when I try and parse the resulting html documents using purrr::map_df and html_nodes , I get the following error: Error in doc_namespaces(doc) : external pointer is not valid For some reason, I am unable to reproduce the error using examples. The example below is not good, because it works totally fine. But if someone could