How do I scrape full-sized images from a website?

问题

I am trying to obtain clinical images of psoriasis patients from these two websites for research purposes:

http://www.dermis.net/dermisroot/en/31346/diagnose.htm

http://dermatlas.med.jhmi.edu/derm/

For the first site, I tried just saving the page with firefox, but it only saved the thumbnails and not the full-sized images. I was able to access the full-sized images using a firefox addon called "downloadthemall", but it saved each image as part of a new html page and I do not know of any way to extract just the images.

I also tried getting on one of my university's linux machines and using wget to mirror the websites, but I was not able to get it to work and am still unsure as to why.

Consequently, I am wondering whether it would be easy to write a short script (or whatever method is easiest) to (a) obtain the full-sized images linked to on the first website, and (b) obtain all full-sized images on the second site with "psoriasis" in the filename.

I have been programming for a couple of years, but have zero experience with web development and would appreciate any advice on how to go about doing this.

回答1:

Why not use wget to recursively download images from the domain? Here is an example:

wget -r -P /save/location -A jpeg,jpg,bmp,gif,png http://www.domain.com

Here is the man page: http://www.gnu.org/software/wget/manual/wget.html

回答2:

Try HTTrack website copier - it will load all the images on the website. You can also try http://htmlparser.sourceforge.net/. It will grab website as well with resources if you specify it in org.htmlparser.parserapplications.SiteCapturer

来源：https://stackoverflow.com/questions/9593873/how-do-i-scrape-full-sized-images-from-a-website

标签

java

python

image

screen-scraping