An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?
I
How about this way:
Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:
for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]www.example.com/news/20150401) to a single thread, it is in the threads where the really data fetch happensWhen the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:
def getPartitions: Array[Partition] is a good place to do the split task.def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.The final program looks like:
class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}
class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {
override protected def getPartitions: Array[CrawlPartition] = {
val partitions = new ArrayBuffer[CrawlPartition]
//split baseURL to subsets and populate the partitions
partitions.toArray
}
override def compute(part: Partition, context: TaskContext): Iterator[X] = {
val p = part.asInstanceOf[CrawlPartition]
val baseUrl = p.baseURL
new Iterator[X] {
var nextURL = _
override def hasNext: Boolean = {
//logic to find next url if has one, fill in nextURL and return true
// else false
}
override def next(): X = {
//logic to crawl the web page nextURL and return the content in X
}
}
}
}
object Crawl {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Crawler")
val sc = new SparkContext(sparkConf)
val crdd = new CrawlRDD("baseURL", sc)
crdd.saveAsTextFile("hdfs://path_here")
sc.stop()
}
}