Configuring parallelism in Storm

喜欢而已 提交于 2019-12-22 01:10:03

问题


I am new to Apache Storm, and I am trying to figure for myself about configuring storm parallelism. So there is a great article "Understanding the Parallelism of a Storm Topology", but it only arouses questions.

When you have a multinode storm cluster each topology is distributed as a whole according to TOPOLOGY_WORKERS configuration parameter. So if you have 5 workers, then you have 5 copies of spout (1 per worker), and the same thing is with bolts.

How to deal with situation like this inside a storm cluster (preferably without creating external services):

  1. I need exactly one spout used by all instances of topology, for example if input data is being pushed to cluster via a net folder, which is scanned for new files.
  2. Similar issue with concrete type of bolts. For example when data is processed by licensed third-party library which is locked to a concrete physical machine.

回答1:


First, the basics:

  1. Workers - Run executors, each worker has its own JVM
  2. Executors - Run tasks, each executor is distributed across various workers by storm
  3. Tasks - Instances running your spout/bolt code

Second, a correction... having 5 workers does NOT mean you will automatically have 5 copies of your spout. Having 5 workers means you have 5 separate JVMs where storm can assign executors to run (think of this as 5 buckets).

The number of instances of your spout is configured when you first create and submit your topology:

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("0-spout", new MySpout(), spoutParallelism).setNumTasks(spoutTasks);

Since you want only one spout for the entire cluster, you'd set both spoutParallelism and spoutTasks to 1.



来源:https://stackoverflow.com/questions/27756085/configuring-parallelism-in-storm

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!