stormcrawler

crawl URLs based on their priorities in StormCrawler

一曲冷凌霜 提交于 2021-02-17 06:52:04
问题 I am working on a crawler based on the StormCrawler project. I have a requirement to crawl URLs based on their priorities. For example, I have two types of priority: HIGH, LOW. I want to crawl HIGH priority URLs as soon as possible before LOW URLs. I need a method for handling the above problem in the crawler. How can I handle this requirement in Apache Storm and StormCrawler? 回答1: With Elasticsearch as a backend, you can configure the spouts to sort the URLs within a bucket by whichever

Storm Crawler with Java 11

谁说我不能喝 提交于 2021-01-28 20:18:00
问题 Trying to update the Java version from Java 8 to Java 11 to compile and run the StromCrawler. My question- Does Storm Crawler is supported on Java 11? . As we I update the java version in my POM and build the project I was successfully build the project but when I tried to run the project I am getting the Following error while running the InjectorTopology- 560 [main] INFO c.a.h.c.InjectorTopology - ####### The Injector Topology Started ####### 563 [main] INFO c.a.h.c.u.PropertyFileReader -

Can i store html content of webpage in storm crawler?

眉间皱痕 提交于 2020-01-13 06:57:09
问题 I am using strom-crawler-elastic. I can able to see the fetched urls and status of those. Configuration change in ES_IndexInit.sh file gives only url,title, host, text. But can i store the entire html content with html tags ? 回答1: The ES IndexerBolt gets the content of pages from the ParseFilter but does not do anything with it. One option would be to modify the code so that it pulls the content field from the incoming tuples and indexes it. Alternatively, you could implement a custom

Explicit special characters from crawling

。_饼干妹妹 提交于 2020-01-06 12:25:33
问题 Working on Storm Crawler 1.13 and elastic search 6.5.2. How to restrict the crawler not to crawl/index the special characters � � � � � ��� �� � • 回答1: An easy way to do this is to write a ParseFilter like ParseData pd = parse.get(URL); String text = pd.getText(); // remove chars pd.setText(text); This will get called on documents parsed by JSoup or Tika. Have a look at the parse filters in the repository for examples. 来源: https://stackoverflow.com/questions/54096045/explicit-special

StormCrawler cannot connect to ElasticSearch

一世执手 提交于 2020-01-06 06:49:27
问题 While running the command: storm jar target/crawlIndexer-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-injector.flux --sleep 86400000 I get an error saying: 8710 [Thread-26-status-executor[4 4]] ERROR c.d.s.e.p.StatusUpdaterBolt - Can't connect to ElasticSearch When running http://localhost:9200/ in browser ES successfully loads up. Kibana also connects to ES. So it must just be the connection from StromCrawler to ElasticSearch. What could be the issue? Snippet of full error: 8710

Stormcrawl with SQL external module gets ParseFilters exception at crawl sage

浪尽此生 提交于 2020-01-06 05:42:39
问题 I use Stromcrawler with SQL external module. I have updated my pop.xml with: <dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-sql</artifactId> <version>1.8</version> </dependency> I use similar injector/crawl procedure as in the case with ES setup: storm jar target/stromcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local sql-injector.flux --sleep 864000 I have created mysql database crawl , table urls and successfully injected my urls in it. For

StormCrawler DISCOVER and FETCH a website but nothing gets saved in docs

坚强是说给别人听的谎言 提交于 2019-12-24 23:16:28
问题 There is a website that I'm trying to crawl, the crawler DISCOVER and FETCH the URLs but there is nothing in docs. this is the website https://cactussara.ir . where is the problem?! And this is the robots.txt of this website: User-agent: * Disallow: / And this is my urlfilters.json : { "com.digitalpebble.stormcrawler.filtering.URLFilters": [ { "class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter", "name": "BasicURLFilter", "params": { "maxPathRepetition": 8, "maxLength":

StormCrawler: Timeout waiting for connection from pool

三世轮回 提交于 2019-12-11 16:14:00
问题 We are consistently getting the following error when we increase either the number of threads or the number of executors for Fetcher bolt. org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[stormjar.jar:?] at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~

Stormcrawler not indexing content with Elasticsearch

泄露秘密 提交于 2019-12-08 12:35:30
问题 When using Stormcrawler it is indexing to Elasticsearch, but not the content. Stormcrawler is up-to-date with 'origin/master' https://github.com/DigitalPebble/storm-crawler.git Using elasticsearch-5.6.4 crawler-conf.yaml has indexer.url.fieldname: "url" indexer.text.fieldname: "content" indexer.canonical.name: "canonical" The url and title fields are indexed, but not content. I have trying to get this working by following Julien's tutorial at: https://www.youtube.com/watch?v=xMCuWpPh-4A

Run StormCrawler in local mode or install Apache Storm?

最后都变了- 提交于 2019-12-08 09:13:06
问题 So I'm trying to figure out how to install and setup Storm/Stormcrawler with ES and Kibana as described here. I never installed Storm on my local machine because I've worked with Nutch before and I never had to install Hadoop locally... thought it might be the same with Storm(maybe not?). I'd like to start crawling with Stormcrawler instead of Nutch now. It seems that if I just download a release and add the /bin to my PATH, I can only talk to a remote cluster. It seems like I need to setup a