How to 'Grab' content from another website

A friend has asked me this, and I couldn't answer.

He asked: I am making this site where you can archive your site...

It works like this, you enter your site like, something.com and then our site grabs the content on that website like images, and all that and uploads it to our site. Then people can view an exact copy of the site at oursite.com/something.com even if the server that is holding up something.com is down.

How could he do this? (php?) and what would be some requirements?

It sounds like you need to create a webcrawler. Web crawlers can be written in any language, although I would recommend using C++ (using cURL), Java (using URLConnection), or Python (w/ urrlib2) for that. You could probably also hack something quickly together with the curl or wget commands and BASH, although that is probably not the best long-term solution. Also, don't forget that you should download, parse, and respect the "robots.txt" file if it is present whenever you crawl someone's website.

Use wget. Either the linux version or the windows version from the gnuwin32 package. get it here.

Fetch the html using curl.
Now change all the images,css,javascript to absolute url if they are relative urls. ( This is bit unethical). You can fetch all these assets and host on from your site.
Respect "robots.txt" of all the sites. read here.

来源：https://stackoverflow.com/questions/3382134/how-to-grab-content-from-another-website

标签

web-crawler