Distributed Web crawling using Apache Spark - Is it Possible?

雨燕双飞 提交于 2019-12-18 10:44:59

问题


An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?

I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?


回答1:


How about this way:

Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:

  1. split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well: for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
  2. assign each base url(www.example.com/news/20150401) to a single thread, it is in the threads where the really data fetch happens
  3. save the result of each thread into FileSystem.

When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:

  1. Split sites: def getPartitions: Array[Partition] is a good place to do the split task.
  2. Threads to crawl each split: def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.
  3. save the rdd into HDFS.

The final program looks like:

class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}

class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {

  override protected def getPartitions: Array[CrawlPartition] = {
    val partitions = new ArrayBuffer[CrawlPartition]
    //split baseURL to subsets and populate the partitions
    partitions.toArray
  }

  override def compute(part: Partition, context: TaskContext): Iterator[X] = {
    val p = part.asInstanceOf[CrawlPartition]
    val baseUrl = p.baseURL

    new Iterator[X] {
       var nextURL = _
       override def hasNext: Boolean = {
         //logic to find next url if has one, fill in nextURL and return true
         // else false
       }          

       override def next(): X = {
         //logic to crawl the web page nextURL and return the content in X
       }
    } 
  }
}

object Crawl {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Crawler")
    val sc = new SparkContext(sparkConf)
    val crdd = new CrawlRDD("baseURL", sc)
    crdd.saveAsTextFile("hdfs://path_here")
    sc.stop()
  }
}



回答2:


Spark adds essentially no value to this task.

Sure, you can do distributed crawling, but good crawling tools already support this out of the box. The datastructures provided by Spark such as RRDs are pretty much useless here, and just to launch crawl jobs, you could just use YARN, Mesos etc. directly at less overhead.

Sure, you could do this on Spark. Just like you could do a word processor on Spark, since it is turing complete... but it doesn't get any easier.




回答3:


YES.

Check out the open source project: Sparkler (spark - crawler) https://github.com/USCDataScience/sparkler

Checkout Sparkler Internals for a flow/pipeline diagram. (Apologies, it is an SVG image I couldn't post it here)

This project wasn't available when the question was posted, however as of December 2016 it is one of the very active projects!.

Is it possible to crawl the Websites using Apache Spark?

The following pieces may help you understand why someone would ask such a question and also help you to answer it.

  • The creators of Spark framework wrote in the seminal paper [1] that RDDs would be less suitable for applications that make asynchronous finegrained updates to shared state, such as a storage system for a web application or an incremental web crawler
  • RDDs are key components in Spark. However, you can create traditional map reduce applications (with little or no abuse of RDDs)
  • There is a widely popular distributed web crawler called Nutch [2]. Nutch is built with Hadoop Map-Reduce (in fact, Hadoop Map Reduce was extracted out from the Nutch codebase)
  • If you can do some task in Hadoop Map Reduce, you can also do it with Apache Spark.

[1] http://dl.acm.org/citation.cfm?id=2228301
[2] http://nutch.apache.org/


PS: I am a co-creator of Sparkler and a Committer, PMC for Apache Nutch.


When I designed Sparkler, I created an RDD which is a proxy to Solr/Lucene based indexed storage. It enabled our crawler-databse RDD to make asynchronous finegrained updates to shared state, which otherwise is not possible natively.




回答4:


There is a project, called SpookyStuff, which is an

Scalable query engine for web scrapping/data mashup/acceptance QA, powered by Apache Spark

Hope it helps!




回答5:


I think the accepted answer is incorrect in one fundamental way; real-life large-scale web extraction is a pull process.

This is because often times requesting HTTP content is far less laborious task than building the response. I have built a small program, which is able to crawl 16 million pages a day with four CPU cores and 3GB RAM and that was not even optimized very well. For similar server such load (~200 requests per second) is not trivial and usually requires many layers of optimization.

Real web-sites can for example break their cache system if you crawl them too fast (instead of having most popular pages in the cache, it can get flooded with the long-tail content of the crawl). So in that sense, a good web-scraper always respects robots.txt etc.

The real benefit of the distributed crawler doesn't come from splitting the workload of one domain, but from splitting the work load of many domains to a single distributed process so that the one process can confidently track how many requests the system puts through.

Of course in some cases you want to be the bad boy and screw the rules; however, in my experience, such products don't stay alive long, since the web-site owners like to protect their assets from things, which look like DoS attacks.

Golang is very good for building web scrapers, since it has channels as native data type and they support pull-queues very well. Because HTTP protocol and scraping in general is slow, you can include the extraction pipelines as part of the process, which will lower the amount of data to be stored in the data warehouse system. You can crawl one TB with spending less than $1 worth of resources and do it fast when using Golang and Google Cloud (probably able to do with AWS and Azure also).

Spark gives you no additional value. Using wget as a client is clever, since it automatically respects robots.txt properly: parallel domain specific pull queue to wget is the way to go if you are working professionally.



来源:https://stackoverflow.com/questions/29950299/distributed-web-crawling-using-apache-spark-is-it-possible

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!