Python + Selenium firefox webdriver - pulling out images out of a website

自古美人都是妖i 提交于 2020-01-24 15:54:06

问题


I am trying to pull out images from a webpage using: Python 2.7 + Selenium (using FireFox) + Beautiful Soup.

The page loads dynamically, hence, I'm using Selenium for screen scraping. Everything looks great on the front end, however, when all the images I loaded, and I look at the HTML, I can't see the links to the images. Any ideas what could be going on here?

Site is https://flipp.com/flyers?postal_code=97035 , and from there, navigate to https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad in order to see the first weekly ad (My working Code is below).

To make things even more weird, I'm able to see that the images ARE loading in the inspector window... But I still can't see them in the HTML. Any idea on whats going on here, and how to grab the updated HTML (after images load?)

Here is the set of images i am able to pull from HTML (by appending jpg). These are just for popup windows for when you hover over the canvas.

What I am trying to get to are actually the images that make up the actual pages/canvas. I can see them come through (using traffic option in firefox), but they are not appearing in HTML for some reason. Any idea whats going on here?

Working code:

#import packages
from time import gmtime, strftime,sleep, time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy, ProxyType
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
#scraping packages
from bs4 import BeautifulSoup


USAPROXY = "177.84.23.122:3128"
def launch_webdriver(PROXY):
    PROXY = PROXY
    PROXY_HOST = PROXY.rpartition(':')[0]
    PROXY_PORT = PROXY.rpartition(':')[2]
    fp = webdriver.FirefoxProfile()
    # Direct = 0, Manual = 1, PAC = 2, AUTODETECT = 4, SYSTEM = 5
    fp.set_preference("network.proxy.type", 1)
    fp.set_preference("network.proxy.http",PROXY_HOST)
    fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
    fp.set_preference("network.proxy.ssl",PROXY_HOST)
    fp.set_preference("network.proxy.ssl_port",int(PROXY_PORT))
    fp.set_preference("general.useragent.override","whater_useragent")    
    fp.update_preferences()
    return webdriver.Firefox(firefox_profile=fp)




def test():
    driver = launch_webdriver(USAPROXY)
    driver.set_page_load_timeout(11)
    driver.get("https://flipp.com/flyers?postal_code=97035")
    sleep(15)
    driver.get("https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad")
    sleep(5)
    my_html = driver.page_source
    soup = BeautifulSoup(my_html,'lxml')
    tags=soup.findAll('img')  #prints only 3 imgs, there should be 100s
    for tag in tags:print tag
    print soup.prettify()
#execute script
test()

回答1:


I did a small change in your code replacing soup = BeautifulSoup(my_html,'lxml') with soup = BeautifulSoup(my_html,'html.parser') as follows :

  • Code :

    driver.set_page_load_timeout(11)
    driver.get("https://flipp.com/flyers?postal_code=97035")
    sleep(15)
    driver.get("https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad")
    sleep(5)
    my_html = driver.page_source
    soup = BeautifulSoup(my_html,'html.parser')
    tags=soup.findAll('img')
    for tag in tags:print (tag)
    
  • Output :

    <img alt="" src="/94815ec0/images/page-favourites.svg"/>
    <img alt="" src="/94815ec0/images/page-flyers.svg"/>
    <img alt="" src="/94815ec0/images/page-coupons.svg"/>
    <img alt="" src="/94815ec0/images/profile.png"/>
    <img alt="" src="/94815ec0/images/signin-google-en.png"/>
    <img alt="" src="/94815ec0/images/signin-facebook-en.png"/>
    <img alt="" class="sl-icon" src="/94815ec0/images/sl/list-icon.svg"/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2143/1399408035/large" is="flipp-lazy-image" src="" style='background-image: url("https://images.wishabi.net/merchants/2143/1399408035/large");'/>
    <img src="/94815ec0/images/location.svg"/>
    <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1568365/web_premium/1519664612.jpg" is="flipp-lazy-image" src="" style='background-image: url("https://f.wishabi.net/flyers/1568365/web_premium/1519664612.jpg");'/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/1417562816/1417562816/large" is="flipp-lazy-image" src="" style='background-image: url("https://images.wishabi.net/merchants/1417562816/1417562816/large");'/>
    <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1570217/web_premium/1519767026.jpg" is="flipp-lazy-image" src="" style='background-image: url("https://f.wishabi.net/flyers/1570217/web_premium/1519767026.jpg");'/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2217/1399408048/large" is="flipp-lazy-image" src="" style='background-image: url("https://images.wishabi.net/merchants/2217/1399408048/large");'/>
    <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1548763/web_premium/1519408077.jpg" is="flipp-lazy-image" src="" style='background-image: url("https://f.wishabi.net/flyers/1548763/web_premium/1519408077.jpg");'/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2392/1412008375/large" is="flipp-lazy-image" src="" style='background-image: url("https://images.wishabi.net/merchants/2392/1412008375/large");'/>
    <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1558209/web_premium/1519940192.jpg" is="flipp-lazy-image" src="" style='background-image: url("https://f.wishabi.net/flyers/1558209/web_premium/1519940192.jpg");'/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2175/1399558010/large" is="flipp-lazy-image" src="" style='background-image: url("https://images.wishabi.net/merchants/2175/1399558010/large");'/>
    <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1553653/web_premium/1519086192.jpg" is="flipp-lazy-image" src="" style='background-image: url("https://f.wishabi.net/flyers/1553653/web_premium/1519086192.jpg");'/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/1415661435/1415661435/large" is="flipp-lazy-image" src="" style='background-image: url("https://images.wishabi.net/merchants/1415661435/1415661435/large");'/>
    <img alt="" src="/94815ec0/images/email_notices.png"/>
    <img alt="Flipp logo"/>
    <img alt="image of a sad ice cream" class="sad-cream"/>
    <img alt="Google Chrome Logo" class="browser-img chrome"/>
    <img alt="Mozilla Firefox Logo" class="browser-img ff"/>
    <img alt="Microsoft Edge Logo" class="browser-img edge"/>
    <img alt="Apple Safari Logo" class="browser-img safari"/>
    <img alt="" height="0" id="batBeacon0.08041384361820791" src="https://bat.bing.com/action/0?ti=5463843&amp;Ver=2&amp;mid=e698c347-3982-6279-c6a5-5e5b764b55dd&amp;evt=pageLoad&amp;sid=ab7428c0-1&amp;lt=1647&amp;pi=0&amp;lg=en-US&amp;sw=1366&amp;sh=768&amp;sc=24&amp;tl=Big%205%20Sporting%20Goods%20Weekly%20Ad%20for%20Lake%20Oswego%20this%20week%20(Feb%2025,%202018%20-%20Mar%203,%202018)%20-%20Flipp&amp;kw=flyers,%20coupons,%20shopping%20list,%20deals,%20circulaires,%20coupons,%20liste%20d%E2%80%99achats,%20offres&amp;p=https%3A%2F%2Fflipp.com%2Fweekly_ad%2F1550082-big-5-sporting-goods-weekly-ad&amp;r=&amp;msclkid=N&amp;rn=558478" style="width:0px; height:0px; display:none; visibility:hidden;" width="0"/>
    



回答2:


The reason why you don't see the updated HTML in your my_html=driver.page_source is because the page_source grabs the HTML before your page has dynamically loaded. Try this instead to get the HTML after the page has loaded:

my_html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
# or
my_html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')

EDIT:

Okay, I think I came up with what you are looking for. I found a way to access the network resources and get the performance data that the browser is logging. Call this function and pass the driver once it has loaded the page you want, and it should return the images in the format you're looking for:

def getNetworkImages(driver):
    ImageList = []
    Resources = driver.execute_script("return window.performance.getEntriesByType('resource');")
    for resource in Resources:
        if resource['initiatorType'] == 'img': ImageList.append(resource['name'])
    for image in ImageList: print(image)
    return ImageList

Note: This was tested with Chrome 64 and Chromedriver 2.35.



来源:https://stackoverflow.com/questions/49079184/python-selenium-firefox-webdriver-pulling-out-images-out-of-a-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!