Scraping an AngularJS application

人盡茶涼 提交于 2019-12-09 01:54:38

问题


I'm scrapping some HTML pages with Rails, using Nokogiri.

I had some problems when I tried to scrap an AngularJS page because the gem is opening the HTML before it has been fully rendered.

Is there some way to scrap this type of page? How can I have the page fully rendered before scraping it?


回答1:


If you're trying to scrape AngularJS pages in a fully generic fashion, then you're likely going to need something like what @tadman mentioned in the comments (PhantomJS) -- some type of headless browser that fully processes the AngularJS JavaScript and opens the DOM up to inspection afterwards.

If you have a specific site or sites that you are looking to scrape, the path of least resistance is likely to avoid the AngularJS frontend entirely and directly query the API from which the Angular code is pulling content. The standard scenario for many/most AngularJS sites is that they pull down the static JS and HTML code/templates, and then they make ajax calls back to a server (either their own, or some third party API) to get content that will be rendered. If you take a look at their code, you can likely directly query whatever angular is calling (i.e. via $http, ngResource, or restangular). The return data is typically JSON and would be much easier to gather vs. true scraping in the post-rendered html result.




回答2:


You can use:

require 'phantomjs'
require 'watir'

b = Watir::Browser.new(:phantomjs)
b.goto URL

doc = Nokogiri::HTML(b.html)

Download phantomjs in http://phantomjs.org/download.html and move the binary for /usr/bin



来源:https://stackoverflow.com/questions/27026930/scraping-an-angularjs-application

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!