问题
Let's say you have some html source that's been scraped with Selenium, and parsed with BeautifulSoup:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
Is there a way to remove, from the html code or the soup object, all elements which either have:
1.) the attribute style=display:none within the html tag source (i.e. <div style = 'display:none'>...</div>)
or
2.) have the display:none property within the page's CSS
回答1:
I think I remember dealing with a web-site like this - the IP address was internally represented via multiple HTML elements, some of them were hidden via display: none style, some had an appropriate CSS class that made them invisible. Getting the real IP address out of this mess via BeautifulSoup was quite difficult.
Good news is that selenium actually handles this use case and whenever you get the .text of a WebElement - it would return you a visible text of an element which is exactly what is needed.
Demo:
In [1]: from selenium import webdriver
In [2]: driver = webdriver.Firefox()
In [3]: driver.get("http://proxylist.hidemyass.com/")
In [4]: for row in driver.find_elements_by_css_selector("section.proxy-results table#listable tr")[1:]:
...: cells = row.find_elements_by_tag_name("td")
...: print(cells[1].text.strip())
...:
101.26.38.162
120.198.236.10
213.85.92.10
...
216.161.239.51
212.200.111.198
来源:https://stackoverflow.com/questions/33597616/filtering-out-html-elements-which-have-displaynone-either-as-a-tag-attribute