Distributed Web crawling using Apache Spark - Is it Possible?

后端 未结 5 2039
忘掉有多难
忘掉有多难 2020-12-24 15:31

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?

I

5条回答
  •  独厮守ぢ
    2020-12-24 15:48

    YES.

    Check out the open source project: Sparkler (spark - crawler) https://github.com/USCDataScience/sparkler

    Checkout Sparkler Internals for a flow/pipeline diagram. (Apologies, it is an SVG image I couldn't post it here)

    This project wasn't available when the question was posted, however as of December 2016 it is one of the very active projects!.

    Is it possible to crawl the Websites using Apache Spark?

    The following pieces may help you understand why someone would ask such a question and also help you to answer it.

    • The creators of Spark framework wrote in the seminal paper [1] that RDDs would be less suitable for applications that make asynchronous finegrained updates to shared state, such as a storage system for a web application or an incremental web crawler
    • RDDs are key components in Spark. However, you can create traditional map reduce applications (with little or no abuse of RDDs)
    • There is a widely popular distributed web crawler called Nutch [2]. Nutch is built with Hadoop Map-Reduce (in fact, Hadoop Map Reduce was extracted out from the Nutch codebase)
    • If you can do some task in Hadoop Map Reduce, you can also do it with Apache Spark.

    [1] http://dl.acm.org/citation.cfm?id=2228301
    [2] http://nutch.apache.org/


    PS: I am a co-creator of Sparkler and a Committer, PMC for Apache Nutch.


    When I designed Sparkler, I created an RDD which is a proxy to Solr/Lucene based indexed storage. It enabled our crawler-databse RDD to make asynchronous finegrained updates to shared state, which otherwise is not possible natively.

提交回复
热议问题