Getting html source when some html is generated by javascript

匿名 (未验证) 提交于 2019-12-03 10:24:21

问题:

I am attempting to get the source code from a webpage including html that is generated by javascript. My code currently is as follows:

from selenium import webdriver from bs4 import BeautifulSoup  case_url = "http://na.leagueoflegends.com/tribunal/en/case/5555631/#nogo" try:     browser = webdriver.Firefox()     browser.get(case_url)     url = browser.page_source     print url     browser.close except:     ...  soup=BeautifulSoup(url) ...extraction code that finds the right tags, but they are empty... 

When I print the source stored in url, it prints the usual HTML, but is missing the generated html information. How do I get the same HTML as when I press f12 (but I would prefer to do this programatically)?

回答1:

You don't really need to use BeautifulSoup for parsing html in this case, selenium itself is pretty powerful in terms of Locating Elements.

Here's how you can parse the contents of each tab/game one by one:

from selenium import webdriver  case_url = "http://na.leagueoflegends.com/tribunal/en/case/5555631/#nogo" browser = webdriver.Firefox() browser.get(case_url)  game_tabs = browser.find_elements_by_xpath('//a[contains(@id, "tab-")]') for index, tab in enumerate(game_tabs, start=1):     tab.click()     game = browser.find_element_by_id('game%d' % index)     game_type = game.find_element_by_id('stat-type-fill').text     game_length = game.find_element_by_id('stat-length-fill').text     game_outcome = game.find_element_by_id('stat-outcome-fill').text      game_chat = game.find_element_by_class_name('chat-log')     enemy_chat = [msg.text for msg in game_chat.find_elements_by_class_name('enemy') if msg.text]     ally_chat = [msg.text for msg in game_chat.find_elements_by_class_name('ally') if msg.text]      print game_type, game_length, game_outcome     print "Enemy chat: ", enemy_chat     print "Ally chat: ", ally_chat     print "------" 

prints:

Classic 34:48 Loss Enemy chat:  [u'Akali [All] [00:01:38] lol', ... ] Ally chat:  [u'Gangplank [All] [00:00:12] anyone remember the april fools lee sin spotlight? lol', ... ] ------ Dominion 19:22 Loss Enemy chat:  [u'Evelynn [All] [00:00:10] Our GP has a Ti-83', ... ] Ally chat:  [u'Miss Fortune [All] [00:00:18] arr ye wodden computer needs to walk the plank!', ... ] 


回答2:

Further to alexce's answer above, your underlying issue was that you were extracting the HTML before the JavaScript had generated it. Selenium returns control as soon as the browser has loaded and does not wait for any post load JavaScript generated HTML.

By using "find_elements", you will be automatically waiting for the elements to appear (depending on the timeout set when instantiating your driver).

If you were to call get "page_source" after the "find_elements", then you would see the full HTML.

I have automated many dynamically client side generated web pages, and have had no issues providing you wait for the HTML to be rendered.

Alexce is correct that there is no need to use BeautifulSoup, but I wanted to make it clear that Selenium is perfectly able to automate JavaScript generated HTML



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!