Why does wget only download the index.html for some websites?

后端 未结 8 936
陌清茗
陌清茗 2020-12-12 13:29

I\'m trying to use wget command:

wget -p http://www.example.com 

to fetch all the files on the main page. For some websites it works but i

8条回答
  •  青春惊慌失措
    2020-12-12 14:03

    I had the same problem downloading files of CFSv2 model. I solved it using mixing of the above answers, but adding the parameter --no-check-certificate

    wget -nH --cut-dirs=2 -p -e robots=off --random-wait -c -r -l 1 -A "flxf*.grb2" -U Mozilla --no-check-certificate https://nomads.ncdc.noaa.gov/modeldata/cfsv2_forecast_6-hourly_9mon_flxf/2018/201801/20180101/2018010100/

    Here a brief explanation of every parameter used, for a further explanation go to the GNU wget 1.2 Manual

    • -nH equivalent to --no-host-directories: Disable generation of host-prefixed directories. In this case, avoid the generation of the directory ./https://nomads.ncdc.noaa.gov/

    • --cut-dirs=: Ignore directory components. In this case, avoid the generation of the directories ./modeldata/cfsv2_forecast_6-hourly_9mon_flxf/

    • -p equivalent to --page-requisites: This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

    • -e robots=off: avoid download robots.txt file

    • -random-wait: Causes the time between the request to vary between 0.5 and 1.5 * seconds, where was specified using the --wait option.

    • -c equivalent to --continue: continue getting a partially-downloaded file.

    • -r equivalent to --recursive: Turn on recursive retrieving. The default maximum depth is 5

    • -l equivalent to --level : Specify recursion maximum depth level

    • -A equivalent to --accept : specify a comma-separated list of the name suffixes or patterns to accept.

    • -U equivalent to --user-agent=: The HTTP protocol allows the clients to identify themselves using a User-Agent header field. This enables distinguishing the WWW software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as ‘Wget/version’, the version being the current version number of Wget.

    • --no-check-certificate: Don't check the server certificate against the available certificate authorities.

提交回复
热议问题