Getting all visible text from a webpage using Selenium

前端 未结 2 2015
日久生厌
日久生厌 2020-11-30 04:16

I\'ve been googling this all day with out finding the answer, so apologies in advance if this is already answered.

I\'m trying to get all visible text from a large

相关标签:
2条回答
  • 2020-11-30 04:47

    Here's a variation on @unutbu's answer:

    #!/usr/bin/env python
    import sys
    from contextlib import closing
    
    import lxml.html as html # pip install 'lxml>=2.3.1'
    from lxml.html.clean        import Cleaner
    from selenium.webdriver     import Firefox         # pip install selenium
    from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
    
    cache = FileSystemCache('.cachedir', threshold=100000)
    
    url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
    
    
    # get page
    page_source = cache.get(url)
    if page_source is None:
        # use firefox to get page with javascript generated content
        with closing(Firefox()) as browser:
            browser.get(url)
            page_source = browser.page_source
        cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
    
    
    # extract text
    root = html.document_fromstring(page_source)
    # remove flash, images, <script>,<style>, etc
    Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
    print root.text_content() # extract text
    

    I've separated your task in two:

    • get page (including elements generated by javascript)
    • extract text

    The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.

    0 讨论(0)
  • 2020-11-30 05:03

    Using lxml, you might try something like this:

    import contextlib
    import selenium.webdriver as webdriver
    import lxml.html as LH
    import lxml.html.clean as clean
    
    url="http://www.yahoo.com"
    ignore_tags=('script','noscript','style')
    with contextlib.closing(webdriver.Firefox()) as browser:
        browser.get(url) # Load page
        content=browser.page_source
        cleaner=clean.Cleaner()
        content=cleaner.clean_html(content)    
        with open('/tmp/source.html','w') as f:
           f.write(content.encode('utf-8'))
        doc=LH.fromstring(content)
        with open('/tmp/result.txt','w') as f:
            for elt in doc.iterdescendants():
                if elt.tag in ignore_tags: continue
                text=elt.text or ''
                tail=elt.tail or ''
                words=' '.join((text,tail)).strip()
                if words:
                    words=words.encode('utf-8')
                    f.write(words+'\n') 
    

    This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).

    0 讨论(0)
提交回复
热议问题