Is Erlang the right choice for a webcrawler?

﹥>﹥吖頭↗ 提交于 2019-12-03 00:53:20

I am also evaluating erlang for use as a web crawler and it looks good so far.

There are lots of existing helpful modules: HTML parser, HTTP client, XPath, regex, cache.

And other people are interested in the same use case, so you can learn from them.

However if this is just a one off project I recommend Python / Ruby / Perl because it will be easier to get started with.

Erlang is fine for this. Its regex library delegates (nearly all) work to PCRE, which should be fast enough. But avoid strings and use binaries instead! They both use a lot less memory and are faster to translate to C strings.

Kiril

If you're familiar and comfortable with erlang then I'd stick with it if I were you, although I'm not familiar with erlang. With that noted, I'll give you some pointers:

  1. Don't use regular expressions to parse HTML, use XPATH instead.
    HTML, while structured, is still quite difficult to parse in the wild and regular expressions are fairly slow and unreliable for parsing HTML.
  2. Determine what your crawler architecture is going to be and what is your re-visit policy.
  3. Find the best selection policy for you and implement it.

A web crawler is a fairly complex system to build and you have to be concerned about speed, performance, scalability and concurrency. Some of the most notable crawlers are written in C++ and Java, but I have not heard of any crawlers written in erlang.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!