How to get data from inspect element of a webpage using Python

♀尐吖头ヾ 提交于 2019-12-06 08:05:09

问题


I'd like to get the data from inspect element using Python. I'm able to download the source code using BeautifulSoup but now I need the text from inspect element of a webpage. I'd truly appreciate if you could advise me how to do it.

Edit: By inspect element I mean, in google chrome, right click gives us an option called inspect element which has code related to each element of that particular page. I'd like to extract that code/ just its text strings.


回答1:


If you want to automatically fetch a web page from Python in a way that runs Javascript, you should look into Selenium. It can automatically drive a web browser (even a headless web browser such as PhantomJS, so you don't have to have a window open).

In order to get the HTML, you'll need to evaluate some javascript. Simple sample code, alter to suit:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get("http://google.com")

# This will get the initial html - before javascript
html1 = driver.page_source

# This will get the html after on-load javascript
html2 = driver.execute_script("return document.documentElement.innerHTML;")

Note 1: If you want a specific element or elements, you actually have a couple of options -- parse the HTML in Python, or write more specific JavaScript that returns what you want.

Note 2: if you actually need specific information from Chrome's tools that is not just dynamically generated HTML, you'll need a way to hook into Chrome itself. No way around that.




回答2:


Inspect element shows all the HTML of the page which is the same as fetching the html using urllib

do something like this

import urllib
from bs4 import BeautifulSoup as BS

html = urllib.urlopen(URL).read()

soup = BS(html)

print soup.findAll(tag_name).get_text()



回答3:


I would like to update answer from Jason S. I wasn't able to start phantomjs on OS X

driver = webdriver.PhantomJS()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
self.service.start()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 74, in start
raise WebDriverException("Unable to start phantomjs with ghostdriver.", e)
selenium.common.exceptions.WebDriverException: Message: Unable to start phantomjs with ghostdriver.

Resolved by answer here by downloading executables

driver = webdriver.PhantomJS("phantomjs-2.0.0-macosx/bin/phantomjs")



回答4:


BeautifulSoup could be used to parse the html document, and extract anything you want. It's not designed for downloading. You could find the elements you want by it's class and id.



来源:https://stackoverflow.com/questions/25027339/how-to-get-data-from-inspect-element-of-a-webpage-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!