There are public datasets availbale on Amazon:
http://aws.amazon.com/publicdatasets/
I would suggest to consider running demo cluster there - and thus to save downloading.
There is also good dataset of the crowled web from Common Crawl, which is also available on amazon s3. http://commoncrawl.org/