An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?
I
Spark adds essentially no value to this task.
Sure, you can do distributed crawling, but good crawling tools already support this out of the box. The datastructures provided by Spark such as RRDs are pretty much useless here, and just to launch crawl jobs, you could just use YARN, Mesos etc. directly at less overhead.
Sure, you could do this on Spark. Just like you could do a word processor on Spark, since it is turing complete... but it doesn't get any easier.