How to 'Grab' content from another website

痴心易碎 提交于 2019-12-02 17:01:01

问题


A friend has asked me this, and I couldn't answer.

He asked: I am making this site where you can archive your site...

It works like this, you enter your site like, something.com and then our site grabs the content on that website like images, and all that and uploads it to our site. Then people can view an exact copy of the site at oursite.com/something.com even if the server that is holding up something.com is down.

How could he do this? (php?) and what would be some requirements?


回答1:


It sounds like you need to create a webcrawler. Web crawlers can be written in any language, although I would recommend using C++ (using cURL), Java (using URLConnection), or Python (w/ urrlib2) for that. You could probably also hack something quickly together with the curl or wget commands and BASH, although that is probably not the best long-term solution. Also, don't forget that you should download, parse, and respect the "robots.txt" file if it is present whenever you crawl someone's website.




回答2:


Use wget. Either the linux version or the windows version from the gnuwin32 package. get it here.




回答3:


  1. Fetch the html using curl.
  2. Now change all the images,css,javascript to absolute url if they are relative urls. ( This is bit unethical). You can fetch all these assets and host on from your site.
  3. Respect "robots.txt" of all the sites. read here.


来源:https://stackoverflow.com/questions/3382134/how-to-grab-content-from-another-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!