How to 'Grab' content from another website

…衆ロ難τιáo~ 提交于 2019-12-02 09:50:13

It sounds like you need to create a webcrawler. Web crawlers can be written in any language, although I would recommend using C++ (using cURL), Java (using URLConnection), or Python (w/ urrlib2) for that. You could probably also hack something quickly together with the curl or wget commands and BASH, although that is probably not the best long-term solution. Also, don't forget that you should download, parse, and respect the "robots.txt" file if it is present whenever you crawl someone's website.

Use wget. Either the linux version or the windows version from the gnuwin32 package. get it here.

  1. Fetch the html using curl.
  2. Now change all the images,css,javascript to absolute url if they are relative urls. ( This is bit unethical). You can fetch all these assets and host on from your site.
  3. Respect "robots.txt" of all the sites. read here.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!