Can this be done if so, how? I want to scrape data from xbox.com but the pages I need to scrape only appear after a successful login.
There are several ways to login automatically, some more complicated than others. xbox.com probably uses the Windows Live API, so you'll have to look into the documentation for that.
It can be done in theory, provided you have a web fetching class that supports cookies. It looks like PHP HTTP_Request2 from PEAR can send cookies if you provide the cookie information as part of the request. All you should need to do would be:
Note that many sites will have anti-scraping techniques of varying degrees of sophistication, and may make this more difficult. It may also be illegal, immoral or contrary to the site user agreement.
Most login forms will set a cookie. So you should use a HTTP class like Zend_Http that can store them for further requests. It's presumably as simple as:
$client = new Zend_Http_Client();
$client->setCookieJar(); // this is the crucial part for "logging in"
// make login request
$client->setUri("http://xbox.com/login");
$client->setParameterPost("login", "hackz0r");
$result = $client->request('POST');
// go scraping
...
The PHP library PGBrowser can get this done pretty easily. Below is a demo code snippet taken from the companion blog. I believe this won't work with the XBox website because Microsoft now uses SSO, but is still applicable to other websites with content behind login forms.
require 'pgbrowser.php';
$b = new PGBrowser();
$b->useCache = true;
$page = $b->get('http://yoursite.com/login'); // Retrieve login web page
$form = $page->forms(1); // Retrieve form
// Note the form field names have to be specified
$form->set('username', "your_username_or_email");
$form->set('password', "your_password");
$page = $form->submit(); // Submit form
echo $page->html; // This shows the web page normally displayed after successful login, e.g. dashboard
You will have to go through the required login transaction by sending POST data with your CURL requests. That said, it is a bad idea to scrape data from behind a login - the site didn't put that information in the public for a reason, and for you to do so might constitute copyright infringement,