automating the login to the uk data service website in R with RCurl or httr

后端 未结 2 1100
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-25 14:19

I am in the process of writing a collection of freely-downloadable R scripts for http://asdfree.com/ to help people analyze the complex sample survey data hosted by the UK d

相关标签:
2条回答
  • 2020-12-25 14:39

    The relevant data variables returned by the form are action and origin, not combobox. Give action the value selection and origin the value from the relevant entry in combobox

    y <- GET( z$url, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
    > y$url
    [1] "https://shib.data-archive.ac.uk:443/idp/Authn/UserPassword"
    

    Edit

    It looks as though the handle pool isn't keeping your session alive correctly. You therefore need to pass the handles directly rather than automatically. Also for the POST command you need to set multipart=FALSE as this is the default for HTML forms. The R command has a different default as it is mainly designed for uploading files. So:

    y <- GET( handle=z$handle, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
    POST(body=values,multipart=FALSE,handle=y$handle)
    Response [https://www.esds.ac.uk/]
      Status: 200
      Content-type: text/html
    
    ...snipped...    
    
    
                    <title>
    
                            Introduction to ESDS
    
                    </title>
    
                    <meta name="description" content="Introduction to the ESDS, home page" />
    
    0 讨论(0)
  • 2020-12-25 14:39

    I think one way to address "enter your organization" page goes like this:

    library(tidyverse)
    library(rvest)
    library(stringr)
    
    org <- "your_organization"
    user <- "your_username"
    password <- "your_password"
    
    signin <- "http://esds.ac.uk/newRegistration/newLogin.asp"
    handle_reset(signin)
    
    # get to org page and enter org
    p0 <- html_session(signin) %>% 
        follow_link("Login")
    org_link <- html_nodes(p0, "option") %>% 
        str_subset(org) %>% 
        str_match('(?<=\\")[^"]*') %>%
        as.character()
    
    f0 <- html_form(p0) %>%
        first() %>%
        set_values(origin = org_link)
    fake_submit_button <- list(name = "submit-btn",
                               type = "submit",
                               value = "Continue",
                               checked = NULL,
                               disabled = NULL,
                               readonly = NULL,
                               required = FALSE)
    attr(fake_submit_button, "class") <- "btn-enabled"
    f0[["fields"]][["submit"]] <- fake_submit_button
    
    c0 <- cookies(p0)$value
    names(c0) <- cookies(p0)$name
    p1 <- submit_form(session = p0, form = f0, config = set_cookies(.cookies = c0))
    

    Unfortunately, that doesn't solve the whole problem—(2) is harder than it looks. I've got more of what I think is a solution posted here: R: use rvest (or httr) to log in to a site requiring cookies. Hopefully someone will help us get the rest of the way.

    0 讨论(0)
提交回复
热议问题