问题
I am trying to pull out images from a webpage using: Python 2.7 + Selenium (using FireFox) + Beautiful Soup.
The page loads dynamically, hence, I'm using Selenium for screen scraping. Everything looks great on the front end, however, when all the images I loaded, and I look at the HTML, I can't see the links to the images. Any ideas what could be going on here?
Site is https://flipp.com/flyers?postal_code=97035 , and from there, navigate to https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad in order to see the first weekly ad (My working Code is below).
To make things even more weird, I'm able to see that the images ARE loading in the inspector window... But I still can't see them in the HTML. Any idea on whats going on here, and how to grab the updated HTML (after images load?)
Here is the set of images i am able to pull from HTML (by appending jpg). These are just for popup windows for when you hover over the canvas.
What I am trying to get to are actually the images that make up the actual pages/canvas. I can see them come through (using traffic option in firefox), but they are not appearing in HTML for some reason. Any idea whats going on here?
Working code:
#import packages
from time import gmtime, strftime,sleep, time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy, ProxyType
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
#scraping packages
from bs4 import BeautifulSoup
USAPROXY = "177.84.23.122:3128"
def launch_webdriver(PROXY):
PROXY = PROXY
PROXY_HOST = PROXY.rpartition(':')[0]
PROXY_PORT = PROXY.rpartition(':')[2]
fp = webdriver.FirefoxProfile()
# Direct = 0, Manual = 1, PAC = 2, AUTODETECT = 4, SYSTEM = 5
fp.set_preference("network.proxy.type", 1)
fp.set_preference("network.proxy.http",PROXY_HOST)
fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
fp.set_preference("network.proxy.ssl",PROXY_HOST)
fp.set_preference("network.proxy.ssl_port",int(PROXY_PORT))
fp.set_preference("general.useragent.override","whater_useragent")
fp.update_preferences()
return webdriver.Firefox(firefox_profile=fp)
def test():
driver = launch_webdriver(USAPROXY)
driver.set_page_load_timeout(11)
driver.get("https://flipp.com/flyers?postal_code=97035")
sleep(15)
driver.get("https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad")
sleep(5)
my_html = driver.page_source
soup = BeautifulSoup(my_html,'lxml')
tags=soup.findAll('img') #prints only 3 imgs, there should be 100s
for tag in tags:print tag
print soup.prettify()
#execute script
test()
回答1:
I did a small change in your code replacing soup = BeautifulSoup(my_html,'lxml')
with soup = BeautifulSoup(my_html,'html.parser')
as follows :
Code :
driver.set_page_load_timeout(11) driver.get("https://flipp.com/flyers?postal_code=97035") sleep(15) driver.get("https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad") sleep(5) my_html = driver.page_source soup = BeautifulSoup(my_html,'html.parser') tags=soup.findAll('img') for tag in tags:print (tag)
Output :
<img alt="" src="/94815ec0/images/page-favourites.svg"/> <img alt="" src="/94815ec0/images/page-flyers.svg"/> <img alt="" src="/94815ec0/images/page-coupons.svg"/> <img alt="" src="/94815ec0/images/profile.png"/> <img alt="" src="/94815ec0/images/signin-google-en.png"/> <img alt="" src="/94815ec0/images/signin-facebook-en.png"/> <img alt="" class="sl-icon" src="/94815ec0/images/sl/list-icon.svg"/> <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2143/1399408035/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/2143/1399408035/large");'/> <img src="/94815ec0/images/location.svg"/> <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1568365/web_premium/1519664612.jpg" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://f.wishabi.net/flyers/1568365/web_premium/1519664612.jpg");'/> <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/1417562816/1417562816/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/1417562816/1417562816/large");'/> <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1570217/web_premium/1519767026.jpg" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://f.wishabi.net/flyers/1570217/web_premium/1519767026.jpg");'/> <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2217/1399408048/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/2217/1399408048/large");'/> <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1548763/web_premium/1519408077.jpg" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://f.wishabi.net/flyers/1548763/web_premium/1519408077.jpg");'/> <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2392/1412008375/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/2392/1412008375/large");'/> <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1558209/web_premium/1519940192.jpg" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://f.wishabi.net/flyers/1558209/web_premium/1519940192.jpg");'/> <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2175/1399558010/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/2175/1399558010/large");'/> <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1553653/web_premium/1519086192.jpg" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://f.wishabi.net/flyers/1553653/web_premium/1519086192.jpg");'/> <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/1415661435/1415661435/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/1415661435/1415661435/large");'/> <img alt="" src="/94815ec0/images/email_notices.png"/> <img alt="Flipp logo"/> <img alt="image of a sad ice cream" class="sad-cream"/> <img alt="Google Chrome Logo" class="browser-img chrome"/> <img alt="Mozilla Firefox Logo" class="browser-img ff"/> <img alt="Microsoft Edge Logo" class="browser-img edge"/> <img alt="Apple Safari Logo" class="browser-img safari"/> <img alt="" height="0" id="batBeacon0.08041384361820791" src="https://bat.bing.com/action/0?ti=5463843&Ver=2&mid=e698c347-3982-6279-c6a5-5e5b764b55dd&evt=pageLoad&sid=ab7428c0-1&lt=1647&pi=0&lg=en-US&sw=1366&sh=768&sc=24&tl=Big%205%20Sporting%20Goods%20Weekly%20Ad%20for%20Lake%20Oswego%20this%20week%20(Feb%2025,%202018%20-%20Mar%203,%202018)%20-%20Flipp&kw=flyers,%20coupons,%20shopping%20list,%20deals,%20circulaires,%20coupons,%20liste%20d%E2%80%99achats,%20offres&p=https%3A%2F%2Fflipp.com%2Fweekly_ad%2F1550082-big-5-sporting-goods-weekly-ad&r=&msclkid=N&rn=558478" style="width:0px; height:0px; display:none; visibility:hidden;" width="0"/>
回答2:
The reason why you don't see the updated HTML in your my_html=driver.page_source
is because the page_source
grabs the HTML before your page has dynamically loaded. Try this instead to get the HTML after the page has loaded:
my_html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
# or
my_html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')
EDIT:
Okay, I think I came up with what you are looking for. I found a way to access the network
resources and get the performance data that the browser is logging. Call this function and pass the driver once it has loaded the page you want, and it should return the images in the format you're looking for:
def getNetworkImages(driver):
ImageList = []
Resources = driver.execute_script("return window.performance.getEntriesByType('resource');")
for resource in Resources:
if resource['initiatorType'] == 'img': ImageList.append(resource['name'])
for image in ImageList: print(image)
return ImageList
Note: This was tested with Chrome 64
and Chromedriver 2.35
.
来源:https://stackoverflow.com/questions/49079184/python-selenium-firefox-webdriver-pulling-out-images-out-of-a-website