An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?
I
YES.
Check out the open source project: Sparkler (spark - crawler) https://github.com/USCDataScience/sparkler
Checkout Sparkler Internals for a flow/pipeline diagram. (Apologies, it is an SVG image I couldn't post it here)
This project wasn't available when the question was posted, however as of December 2016 it is one of the very active projects!.
The following pieces may help you understand why someone would ask such a question and also help you to answer it.
[1] http://dl.acm.org/citation.cfm?id=2228301
[2] http://nutch.apache.org/
PS: I am a co-creator of Sparkler and a Committer, PMC for Apache Nutch.
When I designed Sparkler, I created an RDD which is a proxy to Solr/Lucene based indexed storage. It enabled our crawler-databse RDD to make asynchronous finegrained updates to shared state, which otherwise is not possible natively.