How to login and then download a file from aspx web pages with R

前端 未结 1 1857
生来不讨喜
生来不讨喜 2020-11-30 04:39

I\'m trying to automate the download of the Panel Study of Income Dynamics files available on this web page using R. Clicking on any of those files takes the user

相关标签:
1条回答
  • 2020-11-30 05:05

    Beside storing the cookie after authentication (see my above comment) there was another problematic point in your solution: the ASP.net site sets a VIEWSTATE key-value pair in the cookie which is to be reserved in your queries - if you check, you could not even login in your example (the result of the POST command holds info about how to login, just check it out).

    An outline of a possible solution:

    1. Load RCurl package:

      > library(RCurl)
      
    2. Set some handy curl options:

      > curl = getCurlHandle()
      > curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
      
    3. Load the page for the first time to capture VIEWSTATE:

      > html <- getURL('http://simba.isr.umich.edu/u/Login.aspx', curl = curl)
      
    4. Extract VIEWSTATE with a regular expression or any other tool:

      > viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
      
    5. Set the parameters as your username, password and the VIEWSTATE:

      > params <- list(
          'ctl00$ContentPlaceHolder3$Login1$UserName'    = '<USERNAME>',
          'ctl00$ContentPlaceHolder3$Login1$Password'    = '<PASSWORD>',
          'ctl00$ContentPlaceHolder3$Login1$LoginButton' = 'Log In',
          '__VIEWSTATE'                                  = viewstate
          )
      
    6. Log in at last:

      > html = postForm('http://simba.isr.umich.edu/u/Login.aspx', .params = params, curl = curl)
      

      Congrats, now you are logged in and curl holds the cookie verifying that!

    7. Verify if you are logged in:

      > grepl('Logout', html)
      [1] TRUE
      
    8. So you can go ahead and download any file - just be sure to pass curl = curl in all your queries.

    0 讨论(0)
提交回复
热议问题