Dynamic Data Web Scraping with Python, BeautifulSoup

前端 未结 2 1809
遥遥无期
遥遥无期 2020-12-07 04:44

I am trying to extract this data(number) for many pages from the HTML. The data is different for each page. When I try to use soup.select(\'span[class=\"pull-right\"]\') it

相关标签:
2条回答
  • 2020-12-07 05:17

    The javascript code won't get executed if you retrieve page with the requests.get. So the selenium shall be used instead. It will mimic user like behaviour with the opening of the page in browser, so the js code will be executed.

    To start with selenium, you need to install with pip install selenium. Then to retrieve your item use code below:

    from selenium import webdriver
    
    browser = webdriver.Firefox()
    # List of the page url and selector of element to retrieve.
    wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",
                   ".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]
    for wiki_page in wiki_pages:
        url = wiki_page[0]
        selector = wiki_page[1]
        browser.get(wiki_page)
        page_views_count = browser.find_element_by_css_selector(selector)
        print page_views_count.text
    browser.quit()
    

    NOTE: If you need to run headless browser, consider using PyVirtualDisplay (a wrapper for Xvfb) to run headless WebDriver tests, see 'How do I run Selenium in Xvfb?' for more information.

    0 讨论(0)
  • 2020-12-07 05:31

    You should try using the python plugin selenium. It requires you to download a driver for whatever browser you are using. You will then be able to use selenium to pull out values from the html

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    
    driver = webdriver.Firefox()
    driver.get("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi")
    element = driver.find_element_by_class_name("pull-right")
    // or the following below 
    //element = driver.find_element_by_name("q")
    //element = driver.find_element_by_id("html ID name")
    //element = driver.find_element_by_name("html element name")
    //element = driver.find_element_by_xpath("//input[@id='passwd-id']")
    print(element)
    driver.close()
    
    0 讨论(0)
提交回复
热议问题