Distributed Web crawling using Apache Spark - Is it Possible?

后端 未结 5 2044
忘掉有多难
忘掉有多难 2020-12-24 15:31

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?

I

5条回答
  •  梦毁少年i
    2020-12-24 15:55

    How about this way:

    Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:

    1. split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well: for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
    2. assign each base url(www.example.com/news/20150401) to a single thread, it is in the threads where the really data fetch happens
    3. save the result of each thread into FileSystem.

    When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:

    1. Split sites: def getPartitions: Array[Partition] is a good place to do the split task.
    2. Threads to crawl each split: def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.
    3. save the rdd into HDFS.

    The final program looks like:

    class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}
    
    class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {
    
      override protected def getPartitions: Array[CrawlPartition] = {
        val partitions = new ArrayBuffer[CrawlPartition]
        //split baseURL to subsets and populate the partitions
        partitions.toArray
      }
    
      override def compute(part: Partition, context: TaskContext): Iterator[X] = {
        val p = part.asInstanceOf[CrawlPartition]
        val baseUrl = p.baseURL
    
        new Iterator[X] {
           var nextURL = _
           override def hasNext: Boolean = {
             //logic to find next url if has one, fill in nextURL and return true
             // else false
           }          
    
           override def next(): X = {
             //logic to crawl the web page nextURL and return the content in X
           }
        } 
      }
    }
    
    object Crawl {
      def main(args: Array[String]) {
        val sparkConf = new SparkConf().setAppName("Crawler")
        val sc = new SparkContext(sparkConf)
        val crdd = new CrawlRDD("baseURL", sc)
        crdd.saveAsTextFile("hdfs://path_here")
        sc.stop()
      }
    }
    

提交回复
热议问题