Distributed Web crawling using Apache Spark - Is it Possible?

后端 未结 5 2054
忘掉有多难
忘掉有多难 2020-12-24 15:31

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?

I

5条回答
  •  没有蜡笔的小新
    2020-12-24 15:55

    Spark adds essentially no value to this task.

    Sure, you can do distributed crawling, but good crawling tools already support this out of the box. The datastructures provided by Spark such as RRDs are pretty much useless here, and just to launch crawl jobs, you could just use YARN, Mesos etc. directly at less overhead.

    Sure, you could do this on Spark. Just like you could do a word processor on Spark, since it is turing complete... but it doesn't get any easier.

提交回复
热议问题