问题
I want to scrap some "pro" billboard data, which access requires a premium billboard account. I already have one, but obvisouly I need to login to the billboard website through R in order to be able to scrap this data.
I have no issues with such a thing with regular login pages (for instance, stackoverflow):
##### Stackoverflow login #####
# Packages installation and loading ---------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(rvest,dplyr,tidyr)
packages<-c("rvest","dplyr","tidyr")
lapply(packages, require, character.only = TRUE)
#Address of the login webpage
login_test<-"https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"
#create a web session with the desired login address
pgsession_test<-html_session(login_test)
pgform_test<-html_form(pgsession_test)[[2]] #in this case the submit is the 2nd form
filled_form_test<-set_values(pgform_test, email="myemail", password="mypassword")
filled_form_test$fields[[5]]$type <- "button"
submit_form(pgsession_test, filled_form_test)
The issue with the billboard website is that the login is done through a clickable "sign up" button in the header of the page, which trigger a pop-up window allowing to type in an email address and password, thus allowing the user to login.
So far, I've tried to guess where the login form is in the html output of the billboard page, as it is not obvious, but I don't even think it actually appears, and I think scraping the html code from pop-up windows might require a specific process.
Here is what I've done so far:
##### Billboard scraping #####
# Packages installation and loading ---------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(rvest,dplyr,tidyr)
packages<-c("rvest","dplyr","tidyr")
lapply(packages, require, character.only = TRUE)
# Session setup (required to scrap restricted access web pages) -----------
login<-"https://www.billboard.com/myaccount"
#create a web session with the desired login address
pgsession<-html_session(login)
pgform<-html_form(pgsession)[[2]]
pgform$fields[[2]]$value <- 'myemailaddress'
pgform$fields[[1]]$value <- 'mypassword'
filled_form<-set_values(pgform)
fake_submit_button <- list(name = NULL,
type = "submit",
value = NULL,
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "input"
filled_form[["fields"]][["submit"]] <- fake_submit_button
# filled_form$fields[[3]]$type <- "button"
submit_form(pgsession, filled_form)
The returned error is :
Warning message:
In request_GET(session, url = url, query = request$values, ...) :
Not Found (HTTP 404).
Which I understand as simply the result of me not using the right login form, that I also suspect not be available in my html output (given by pgform<-html_form(pgsession)[[2]] in the code above).
Note that I've also tried with pgform<-html_form(pgsession)[[1]].
Thank you in advance for your help.
来源:https://stackoverflow.com/questions/61030629/login-to-a-website-billboard-com-for-scraping-purposes-using-r-when-the-login