Getting all visible text from a webpage using Selenium

*爱你&永不变心* 提交于 2019-11-26 10:59:30

问题


I\'ve been googling this all day with out finding the answer, so apologies in advance if this is already answered.

I\'m trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.

After a couple of days of research, I decided that Selenium was my best chance. I\'ve found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:

from selenium import webdriver
import codecs

filen = codecs.open(\'outoput.txt\', encoding=\'utf-8\', mode=\'w+\')

driver = webdriver.Firefox()

driver.get(\"http://www.examplepage.com\")

allelements = driver.find_elements_by_xpath(\"//*\")

ferdigtxt = []

for i in allelements:

      if i.text in ferdigtxt:
          pass
  else:
         ferdigtxt.append(i.text)
         filen.writelines(i.text)

filen.close()

driver.quit()

The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)

I\'m guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.

Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I\'m out of ideas for this one.

Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text


回答1:


Using lxml, you might try something like this:

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean

url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
    browser.get(url) # Load page
    content=browser.page_source
    cleaner=clean.Cleaner()
    content=cleaner.clean_html(content)    
    with open('/tmp/source.html','w') as f:
       f.write(content.encode('utf-8'))
    doc=LH.fromstring(content)
    with open('/tmp/result.txt','w') as f:
        for elt in doc.iterdescendants():
            if elt.tag in ignore_tags: continue
            text=elt.text or ''
            tail=elt.tail or ''
            words=' '.join((text,tail)).strip()
            if words:
                words=words.encode('utf-8')
                f.write(words+'\n') 

This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).




回答2:


Here's a variation on @unutbu's answer:

#!/usr/bin/env python
import sys
from contextlib import closing

import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean        import Cleaner
from selenium.webdriver     import Firefox         # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug

cache = FileSystemCache('.cachedir', threshold=100000)

url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"


# get page
page_source = cache.get(url)
if page_source is None:
    # use firefox to get page with javascript generated content
    with closing(Firefox()) as browser:
        browser.get(url)
        page_source = browser.page_source
    cache.set(url, page_source, timeout=60*60*24*7) # week in seconds


# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text

I've separated your task in two:

  • get page (including elements generated by javascript)
  • extract text

The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.



来源:https://stackoverflow.com/questions/7947579/getting-all-visible-text-from-a-webpage-using-selenium

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!