问题
I would like to download some free-to-download pdfs (copies of old newspaper) from this website of the Austrian National Library with wget using the bash script below:
for year in {14..57}; do
for month in `seq -w 1 12`; do # -w for leading zero
for day in `seq -w 1 31`; do
wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_18$year$month$day.pdf
done
done
done
Aside of some newspaper issues not being available, I cannot download any issues even though they exist. I would get errors such as the one for the existing issue of June 30, 1814 for example:
http://anno.onb.ac.at/pdfs/ONB_lzg_18140630.pdf
Aufl"osen des Hostnamens anno.onb.ac.at (anno.onb.ac.at)... 193.170.112.230
Verbindungsaufbau zu anno.onb.ac.at (anno.onb.ac.at)|193.170.112.230|:80 ... verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet ... 404 Not Found
FEHLER 404: Not Found.
However, if you were to download the corresponding pdfs manually (here, see upper-right corner) you have to press "ok" in a pop-up acknowledgement. Once you did this, I can even download the issue via wget without a problem.
How can I tell wget to confirm via the command line the acknowledgements (the question you get once you want to download a pdf), see screenshot below? Is there a command in wget for that?
回答1:
There are two issues in your code.
lgznewspaper is not available for all the dates- The PDF are not always generated and cached on the URL you used. You need to first run the other URL to make sure the PDF is generated
Below is the updated code that should work
#!/bin/bash
for year in {14..57}; do
DATES=$(curl -sS "http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=18$year&zoom=33" | gawk 'match($0, /datum=([^&]+)/, ary) {print ary[1]}' | xargs echo)
for date in $DATES
do
echo "Downloading for $date"
curl "http://anno.onb.ac.at/cgi-content/anno_pdf.pl?aid=lzg&datum=$date" -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' -H 'DNT: 1' -H "Referer: http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=$date" -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.9' --compressed
wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_$date.pdf
done
done
来源:https://stackoverflow.com/questions/50100023/popups-block-bulk-download-of-pdfs-from-website-with-wget