Does solr do web crawling?

后端 未结 8 1549
Happy的楠姐
Happy的楠姐 2020-12-08 08:09

I am interested to do web crawling. I was looking at solr.

Does solr do web crawling, or what are the steps to do web crawling?

8条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-08 08:56

    Solr 5 started supporting simple webcrawling (Java Doc). If want search, Solr is the tool, if you want to crawl, Nutch/Scrapy is better :)

    To get it up and running, you can take a detail look at here. However, here is how to get it up and running in one line:

    java 
    -classpath /dist/solr-core-5.4.1.jar 
    -Dauto=yes 
    -Dc=gettingstarted     -> collection: gettingstarted
    -Ddata=web             -> web crawling and indexing
    -Drecursive=3          -> go 3 levels deep
    -Ddelay=0              -> for the impatient use 10+ for production
    org.apache.solr.util.SimplePostTool   -> SimplePostTool
    http://datafireball.com/      -> a testing wordpress blog
    

    The crawler here is very "naive" where you can find all the code from this Apache Solr's github repo.

    Here is how the response looks like:

    SimplePostTool version 5.0.0
    Posting web pages to Solr url http://localhost:8983/solr/gettingstarted/update/extract
    Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
    SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked
    Entering recursive mode, depth=3, delay=0s
    Entering crawl at level 0 (1 links total, 1 new)
    POSTed web resource http://datafireball.com (depth: 0)
    Entering crawl at level 1 (52 links total, 51 new)
    POSTed web resource http://datafireball.com/2015/06 (depth: 1)
    ...
    Entering crawl at level 2 (266 links total, 215 new)
    ...
    POSTed web resource http://datafireball.com/2015/08/18/a-few-functions-about-python-path (depth: 2)
    ...
    Entering crawl at level 3 (846 links total, 656 new)
    POSTed web resource http://datafireball.com/2014/09/06/node-js-web-scraping-using-cheerio (depth: 3)
    SimplePostTool: WARNING: The URL http://datafireball.com/2014/09/06/r-lattice-trellis-another-framework-for-data-visualization/?share=twitter returned a HTTP result status of 302
    423 web pages indexed.
    COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update/extract...
    Time spent: 0:05:55.059
    

    In the end, you can see all the data are indexed properly.

提交回复
热议问题