Web scraping a website with dynamic javascript content

可紊 提交于 2019-12-28 05:56:25

问题


So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?


回答1:


There are basically two main options to proceed with:

  • using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
  • use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS

The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.

The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.

Hope that helps.



来源:https://stackoverflow.com/questions/22715036/web-scraping-a-website-with-dynamic-javascript-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!