Login to a website (billboard.com) for scraping purposes using R, when the login is done through a pop-up window

限于喜欢 提交于 2020-04-11 06:45:20

问题


I want to scrap some "pro" billboard data, which access requires a premium billboard account. I already have one, but obvisouly I need to login to the billboard website through R in order to be able to scrap this data.
I have no issues with such a thing with regular login pages (for instance, stackoverflow):

##### Stackoverflow login #####
# Packages installation and loading ---------------------------------------

if (!require("pacman")) install.packages("pacman")
pacman::p_load(rvest,dplyr,tidyr)
packages<-c("rvest","dplyr","tidyr")
lapply(packages, require, character.only = TRUE)


#Address of the login webpage
login_test<-"https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"

#create a web session with the desired login address
pgsession_test<-html_session(login_test)
pgform_test<-html_form(pgsession_test)[[2]]  #in this case the submit is the 2nd form
filled_form_test<-set_values(pgform_test, email="myemail", password="mypassword")

filled_form_test$fields[[5]]$type <- "button"

submit_form(pgsession_test, filled_form_test)

The issue with the billboard website is that the login is done through a clickable "sign up" button in the header of the page, which trigger a pop-up window allowing to type in an email address and password, thus allowing the user to login.
So far, I've tried to guess where the login form is in the html output of the billboard page, as it is not obvious, but I don't even think it actually appears, and I think scraping the html code from pop-up windows might require a specific process.
Here is what I've done so far:

##### Billboard scraping #####
# Packages installation and loading ---------------------------------------

if (!require("pacman")) install.packages("pacman")
pacman::p_load(rvest,dplyr,tidyr)
packages<-c("rvest","dplyr","tidyr")
lapply(packages, require, character.only = TRUE)



# Session setup (required to scrap restricted access web pages) -----------
login<-"https://www.billboard.com/myaccount"

#create a web session with the desired login address
pgsession<-html_session(login)
pgform<-html_form(pgsession)[[2]]  
pgform$fields[[2]]$value <- 'myemailaddress'
pgform$fields[[1]]$value <- 'mypassword'




filled_form<-set_values(pgform)

fake_submit_button <- list(name = NULL,
                           type = "submit",
                           value = NULL,
                           checked = NULL,
                           disabled = NULL,
                           readonly = NULL,
                           required = FALSE)
attr(fake_submit_button, "class") <- "input"
filled_form[["fields"]][["submit"]] <- fake_submit_button

# filled_form$fields[[3]]$type <- "button"
submit_form(pgsession, filled_form)

The returned error is :
Warning message: In request_GET(session, url = url, query = request$values, ...) : Not Found (HTTP 404).

Which I understand as simply the result of me not using the right login form, that I also suspect not be available in my html output (given by pgform<-html_form(pgsession)[[2]] in the code above).
Note that I've also tried with pgform<-html_form(pgsession)[[1]].
Thank you in advance for your help.

来源:https://stackoverflow.com/questions/61030629/login-to-a-website-billboard-com-for-scraping-purposes-using-r-when-the-login

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!