How to mirror only a section of a website?

爷,独闯天下 提交于 2019-11-28 13:34:56

问题


I cannot get wget to mirror a section of a website (a folder path below root) - it only seems to work from the website homepage.

I've tried many options - here is one example

wget -rkp -l3 -np  http://somewebsite/subpath/down/here/

While I only want to mirror the content links below that URL - I also need to download all the page assets which are not in that path.

It seems to work fine for the homepage (/) but I can't get it going for any sub folders.


回答1:


Use the --mirror (-m) and --no-parent (-np) options, plus a few of cool ones, like in this example:

wget --mirror --page-requisites --adjust-extension --no-parent --convert-links
     --directory-prefix=sousers http://stackoverflow.com/users



回答2:


I usually use:

wget -m -np -p $url



回答3:


I use pavuk to accomplish mirrors, as it seemed much better for this purpose just from the beginning. You can use something like this:

/usr/bin/pavuk -enable_js -fnrules F '*.php?*' '%o.php' -tr_str_str '?' '_questionmark_' \
               -norobots -dont_limit_inlines -dont_leave_dir \
               http://www.example.com/some_directory/ >OUT 2>ERR



回答4:


Check out archivebox.io, it's an open-source, self-hosted tool that creates a local, static, browsable HTML clone of websites (it saves HTML, JS, media files, PDFs, screenshot, static assets and more).

By default, it only archives the URL you specify, but we're adding a --depth=n flag soon that will let you recursively archive links from the given URL.



来源:https://stackoverflow.com/questions/6145641/how-to-mirror-only-a-section-of-a-website

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!