Creating a bot/crawler

╄→尐↘猪︶ㄣ 提交于 2019-12-03 09:04:38
Chris Buckett

There are two parts to this.

  1. Get the page from the remote site.
  2. Read the page into a class that you can parse.

For the first part, if you are planning on running this client-side, you are likely to run into cross-site issues, in that your page, served from server X, cannot request pages from server Y, unless the correct headers are set.

See: CORS with Dart, how do I get it to work? and Dart application and cross domain policy or the site in question needs to be returning the correct CORS headers.

Assuming that you can actually get the pages from the remote site client-side, you can use HttpRequest to retrieve the actual content:

// snippet of code...
new HttpRequest.get("http://www.example.com", (req) {
  // process the req.responseText
});

You can also use HttpRequest.getWithCredentials. If the site has some custom login, then you will probably problems (as you will likely be having to Http POST the username and password from your site into their server -

This is when the second part comes in. You can process your HTML using the DocumentFragment.html(...) constructor, which gives you a nodes collection that you can iterate and recurse through. The example below shows this for a static block of html, but you could use the data returned from the HttpRequest above.

import 'dart:html';

void main() {
  var d = new DocumentFragment.html("""
    <html>
      <head></head>
      <body>Foo</body>
    </html>
  """);

  // print the content of the top level nods
  d.nodes.forEach((node) => print(node.text)); // prints "Foo"
  // real-world - use recursion to go down the hierarchy.

}

I'm guessing (not having written a spider before) that you'd be wanting to pull out specific tags at specific locations / depths to sum as your results, and also add urls in <a> hyperlinks to a queue that your bot will navigate into.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!