How can a scraped HTML be different from the source code?

前端 未结 2 1238
一个人的身影
一个人的身影 2020-12-21 15:19

I\'m scraping a list of restaurants from a website (with permission) and I have a problem. The html python scrapes from the website is different from the html in the source

2条回答
  •  南笙
    南笙 (楼主)
    2020-12-21 15:37

    You can use Selenium for this purpose. It will render your web page in run time just like your browser does. You can use Selenium with firefox, chrome or phantomjs.

    Selenium

    We use selenium basically to completely render our web page as most of the sites are made up of Modern JavaScript frameworks. Mostly it is used in developing Crawlers/Scrappers for gathering data from different pages of a website or Selenium is also used in web automation.

    More on Selenium, read it here http://selenium-python.readthedocs.io/ Also I have blog post on Slenium for the beginners. Check this one too http://blog.hassanmehmood.com/creating-your-first-crawler-in-python/

    Example

    import urllib
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    
    profile_link = 'http://hassanmehmood.com'
    
    
    class TitleScrapper(object):
    
        def __init__(self):
    
            fp = webdriver.FirefoxProfile()
            fp.set_preference("browser.startup.homepage_override.mstone", "ignore") #Avoid startup screen
            fp.set_preference("startup.homepage_welcome_url.additional",  "about:blank")
    
            self.driver = webdriver.Firefox(firefox_profile=fp)
            self.driver.set_window_size(1120, 550)
    
        def scrape_profile(self):
            self.driver.get(profile_link)
            print self.driver.title
            self.driver.close()
    
        def scrape(self):
            self.scrape_profile()
    
    
    if __name__ == '__main__':
        scraper = TitleScrapper()
        scraper.scrape()
    

提交回复
热议问题