Scraping from a website that requires a login?

前端 未结 5 1972
旧时难觅i
旧时难觅i 2020-12-17 04:30

Can this be done if so, how? I want to scrape data from xbox.com but the pages I need to scrape only appear after a successful login.

相关标签:
5条回答
  • 2020-12-17 05:08

    There are several ways to login automatically, some more complicated than others. xbox.com probably uses the Windows Live API, so you'll have to look into the documentation for that.

    0 讨论(0)
  • 2020-12-17 05:10

    It can be done in theory, provided you have a web fetching class that supports cookies. It looks like PHP HTTP_Request2 from PEAR can send cookies if you provide the cookie information as part of the request. All you should need to do would be:

    • Send a login request
    • Extract the cookie data from the HTTP headers of the response to the above request
    • Set this cookie data on subsequent requests

    Note that many sites will have anti-scraping techniques of varying degrees of sophistication, and may make this more difficult. It may also be illegal, immoral or contrary to the site user agreement.

    0 讨论(0)
  • 2020-12-17 05:17

    Most login forms will set a cookie. So you should use a HTTP class like Zend_Http that can store them for further requests. It's presumably as simple as:

    $client = new Zend_Http_Client();
    $client->setCookieJar();   // this is the crucial part for "logging in"
    
    // make login request
    $client->setUri("http://xbox.com/login");
    $client->setParameterPost("login", "hackz0r");
    $result = $client->request('POST');
    
    // go scraping
    ...
    
    0 讨论(0)
  • 2020-12-17 05:17

    The PHP library PGBrowser can get this done pretty easily. Below is a demo code snippet taken from the companion blog. I believe this won't work with the XBox website because Microsoft now uses SSO, but is still applicable to other websites with content behind login forms.

    require 'pgbrowser.php';
    
    $b = new PGBrowser();
    $b->useCache = true;
    
    $page = $b->get('http://yoursite.com/login'); // Retrieve login web page
    $form = $page->forms(1); // Retrieve form
    
    // Note the form field names have to be specified
    $form->set('username', "your_username_or_email");
    $form->set('password', "your_password");
    $page = $form->submit(); // Submit form
    
    echo $page->html; // This shows the web page normally displayed after successful login, e.g. dashboard
    
    0 讨论(0)
  • 2020-12-17 05:33

    You will have to go through the required login transaction by sending POST data with your CURL requests. That said, it is a bad idea to scrape data from behind a login - the site didn't put that information in the public for a reason, and for you to do so might constitute copyright infringement,

    0 讨论(0)
提交回复
热议问题