Beautifulsoup returns incomplete html

梦想的初衷 提交于 2020-01-23 17:08:39

问题


I am reading a book about Python right now. There is a small project for homework: "Write a program that goes to a photo-sharing site like Flickr or Imgur, searches for a category of photos, and then downloads all the resulting images." It is suggested to use only webbrowser, requests and bs4 libraries.

I cannot do it for Flickr. I found that the parser cannot go inside the element (div class="interaction-view"). Using "Inspect element" in Chrome I can see that there are a few "div" elements inside it and "a" element. However, when I use bs4 library it cannot see it.

My code like this:

#!/usr/bin/env python3
# To download photos from Flickr

import requests, bs4

search_name = "spam"
website_name = requests.get('https://www.flickr.com/search/?text='
                       + search_name)
website_name.raise_for_status()
parse_obj = bs4.BeautifulSoup(website_name.text, "html.parser")
elements = parse_obj.select('body #content main .main.search-photos-results \
                .view.photo-list-view.requiredToShowOnServer \
                .view.photo-list-photo-view.requiredToShowOnServer.awake \
                .interaction-view')
print(elements)

It only prints:

[<div class="interaction-view"></div>, <div class="interaction-view"></div>...]

Without any nested elements and I do not understand why... Thank you!


回答1:


The issue is that the content of <div class="interaction-view"></div> on flickr is only loaded via javascript. You can check that if you view the page source, you'll find: <div class="interaction-view"></div> with no content in the div tag.

You need to execute javascript somehow. Since beautifulsoup doesn't offer this, one solution is to use selenium for that. pip install selenium and install geckodriver for firefox (on OSX: brew install geckodriver). Then change your code to use selenium to load the page:

#!/usr/bin/env python3

import requests, bs4
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

search_name = "spam"
url = 'https://www.flickr.com/search/?text=%s' % search_name

browser = webdriver.Firefox()
browser.get(url)
delay = 3
WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_id('...')))

soup = bs4.BeautifulSoup(browser.page_source, "html.parser")


elements = soup.select('body #content main .main.search-photos-results \
                .view.photo-list-view.requiredToShowOnServer \
                .view.photo-list-photo-view.requiredToShowOnServer.awake \
                .interaction-view')
print(elements)

The WebDriverWait part is needed so selenium waits with parsing until a certain element is loaded. You need to change ... to an id you know will be present. See this answer to check how it can be done with classes.



来源:https://stackoverflow.com/questions/41706274/beautifulsoup-returns-incomplete-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!