Download all files of a particular type from a website using wget stops in the starting url

血红的双手。 提交于 2019-12-21 01:47:29

问题


The following did not work.

wget -r -A .pdf home_page_url

It stop with the following message:

....
Removing site.com/index.html.tmp since it should be rejected.
FINISHED

I don't know why it only stops in the starting url, do not go into the links in it to search for the given file type.

Any other way to recursively download all pdf files in an website. ?


回答1:


It may be based on a robots.txt. Try adding -e robots=off.

Other possible problems are cookie based authentication or agent rejection for wget. See these examples.

EDIT: The dot in ".pdf" is wrong according to sunsite.univie.ac.at




回答2:


the following cmd works for me, it will download pictures of a site

wget -A pdf,jpg,png -m -p -E -k -K -np http://site/path/



回答3:


This is certainly because of the links in the HTML don't end up with /.

Wget will not follow this has it think it's a file (but doesn't match your filter):

<a href="link">page</a>

But will follow this:

<a href="link/">page</a>

You can use the --debug option to see if it's the actual problem.

I don't know any good solution for this. In my opinion this is a bug.



来源:https://stackoverflow.com/questions/18274586/download-all-files-of-a-particular-type-from-a-website-using-wget-stops-in-the-s

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!