问题
I am trying to piece together how rvest is used, and I thought I'd got it but all the results I receive are null.
I am using @RonakShah 's example (Loop with rvest) as my base example and thought I'd try and expand to instead collect the name, telephone and hours open each day:
site = "https://concreteplayground.com/auckland/bars/archie-brothers-cirque-electriq"
get_phone <- function(url) {
webpage <- site %>% read_html()
name <- webpage %>% html_nodes('p.name') %>%html_text() %>% trimws()
telephone <- webpage %>% html_nodes('p.telephone') %>%html_text() %>% trimws()
monday <- webpage %>% html_nodes('p.day a') %>%html_text() %>% trimws()
tuesday <- webpage %>% html_nodes('p.day a') %>%html_text() %>% trimws()
wednesday <- webpage %>% html_nodes('p.day a') %>%html_text() %>% trimws()
thursday <- webpage %>% html_nodes('p.day a') %>%html_text() %>% trimws()
friday <- webpage %>% html_nodes('p.day a') %>%html_text() %>% trimws()
saturday <- webpage %>% html_nodes('p.day a') %>%html_text() %>% trimws()
sunday <- webpage %>% html_nodes('p.day a') %>%html_text() %>% trimws()
data.frame(telephone, monday, tuesday, wednesday, thursday, friday, saturday, sunday)
}
get_phone(site)
But I can't get any of these to work individually? I can't even get it to read the day in or the incorrect phone number. Would someone help point out why?
回答1:
Right click on the webpage, select Inspect
and check the HMTL of the webpage. Find the element that you want to extract and use CSS selectors to scrape it.
library(rvest)
site <- "https://concreteplayground.com/auckland/bars/archie-brothers-cirque-electriq"
get_phone <- function(url) {
webpage <- site %>% read_html()
phone <- webpage %>% html_nodes('span[itemprop="telephone"]') %>% html_text()
opening_hours <- webpage %>%
html_nodes('div.open-hours') %>%
html_attr('data-times') %>% jsonlite::fromJSON()
list(phone_number = phone, opening_hours = opening_hours)
}
get_phone(site)
#$phone_number
#[1] "+64 800 888 386"
#$opening_hours
# weekday time_from time_to
#1 1 12:00 00:00
#2 2 12:00 00:00
#3 3 12:00 00:00
#4 4 12:00 00:00
#5 5 12:00 00:00
#6 6 10:00 00:00
#7 0 10:00 00:00
Opening hours are stored in a json file which is helpful so we don't have to individually scrape them and bind them together.
来源:https://stackoverflow.com/questions/63007915/rvest-returning-null-values