Scrape web pages in real time with Node.js

前端未结

关注

 8  2217

闹比i 2020-11-29 15:43

What\'s a good was to scrape website content using Node.js. I\'d like to build something very, very fast that can execute searches in the style of kayak.com, where one query

8条回答

孤独总比滥情好 (楼主)

2020-11-29 16:19
All aforementioned solutions presume running the scraper locally. This means you will be severely limited in performance (due to running them in sequence or in a limited set of threads). A better approach, imho, is to rely on an existing, albeit commercial, scraping grid.

Here is an example:
```
var bobik = new Bobik("YOUR_AUTH_TOKEN");
bobik.scrape({
  urls: ['amazon.com', 'zynga.com', 'http://finance.google.com/', 'http://shopping.yahoo.com'],
  queries:  ["//th", "//img/@src", "return document.title", "return $('script').length", "#logo", ".logo"]
}, function (scraped_data) {
  if (!scraped_data) {
    console.log("Data is unavailable");
    return;
  }
  var scraped_urls = Object.keys(scraped_data);
  for (var url in scraped_urls)
    console.log("Results from " + url + ": " + scraped_data[scraped_urls[url]]);
});
```
Here, scraping is performed remotely and a callback is issued to your code only when results are ready (there is also an option to collect results as they become available).

You can download Bobik client proxy SDK at https://github.com/emirkin/bobik_javascript_sdk
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...