How do I scrape full-sized images from a website?

女生的网名这么多〃 提交于 2019-12-24 21:08:31

问题


I am trying to obtain clinical images of psoriasis patients from these two websites for research purposes:

http://www.dermis.net/dermisroot/en/31346/diagnose.htm

http://dermatlas.med.jhmi.edu/derm/

For the first site, I tried just saving the page with firefox, but it only saved the thumbnails and not the full-sized images. I was able to access the full-sized images using a firefox addon called "downloadthemall", but it saved each image as part of a new html page and I do not know of any way to extract just the images.

I also tried getting on one of my university's linux machines and using wget to mirror the websites, but I was not able to get it to work and am still unsure as to why.

Consequently, I am wondering whether it would be easy to write a short script (or whatever method is easiest) to (a) obtain the full-sized images linked to on the first website, and (b) obtain all full-sized images on the second site with "psoriasis" in the filename.

I have been programming for a couple of years, but have zero experience with web development and would appreciate any advice on how to go about doing this.


回答1:


Why not use wget to recursively download images from the domain? Here is an example:

wget -r -P /save/location -A jpeg,jpg,bmp,gif,png http://www.domain.com

Here is the man page: http://www.gnu.org/software/wget/manual/wget.html




回答2:


Try HTTrack website copier - it will load all the images on the website. You can also try http://htmlparser.sourceforge.net/. It will grab website as well with resources if you specify it in org.htmlparser.parserapplications.SiteCapturer



来源:https://stackoverflow.com/questions/9593873/how-do-i-scrape-full-sized-images-from-a-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!