Need help scraping images from a slideshow with bs4 & python

眉间皱痕 提交于 2021-01-29 10:36:48

问题


I'm trying scrap listing information from Craigslist, unfortunately I can't seem to get the images since they are in a slideshow.

import requests
from bs4 import BeautifulSoup as soup

url = "https://newyork.craigslist.org/search/sss"
r = requests.get(url)
souped = soup(r.content, 'lxml')

Since the images aren't even in the html file requested, do I need to somehow dynamically load the page or something. If so can I keep it only in python, I don't want any other dependencies. Thanks in advance, pretty new to this so any help would be helpful.


回答1:


Look for the A tags with classes result-image gallery. Each of those tags have a data-ids attribute which olds part of the names of the images files.

<a href="https://newyork.craigslist.org/mnh/fuo/d/new-york-city-3-piece-shaped-ikea-couch/6812749499.html" class="result-image gallery" data-ids="1:00707_iRUU5VKwkWi,1:00H0H_6AIBqK2iQDU">
           ....
</a>

Now, if you want to get the urls, first get that attribute and parse the partial image's names (on that example, 00707_iRUU5VKwkWi and 00H0H_6AIBqK2iQDU).

And now you can build the urls with the host and, the suffix (_300x300) and the extension:

https://images.craigslist.org/00707_iRUU5VKwkWi_300x300.jpg
https://images.craigslist.org/00H0H_6AIBqK2iQDU_300x300.jpg


来源:https://stackoverflow.com/questions/54554056/need-help-scraping-images-from-a-slideshow-with-bs4-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!