I just began with Apache Storm. I read the tutorial and had a look into examples My problem is that all example work with very simple tuples (often one filed with a string). The tuples are created inline (using new Values(...)). In my case i have tuples with many fields (5..100). So my question is how to implement such tuple with name and type (all primitive) for each field?
Are there any examples? (i think directly implementing "Tuple" isn't a good idea)
thanks
An alternative to creating the tuple with all of the fields as a value is to just create a bean and pass that inside the tuple.
Given the following class:
public class DataBean implements Serializable {
private static final long serialVersionUID = 1L;
// add more properties as necessary
int id;
String word;
public DataBean(int id, String word) {
setId(id);
setWord(word);
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
}
Create and emit the DataBean in one bolt:
collector.emit(new Values(bean));
Get the DataBean in the destination bolt:
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
try {
DataBean bean = (DataBean)tuple.getValue(0);
// do your bolt processing with the bean
} catch (Exception e) {
LOG.error("WordCountBolt error", e);
collector.reportError(e);
}
}
Don't forget to make your bean serializable and register when you set up your topology:
Config stormConfig = new Config();
stormConfig.registerSerialization(DataBean.class);
// more stuff
StormSubmitter.submitTopology("MyTopologyName", stormConfig, builder.createTopology());
Disclaimer: Beans will work fine for shuffle grouping. If you need to do a fieldsGrouping
, you should still use a primitive. For example, in the Word Count scenario, you need go group by word so you might emit:
collector.emit(new Values(word, bean));
I would implement a custom tuple/value type as follows: Instead of using member variables to store the data, each attribute is mapped to a fixed index into the object list of the inherited Values
types. This approach avoids the "field grouping" problem a regular Bean.
- it in not required to add additional attributes for fields grouping (what is quite unnatural)
- data duplication is avoided (reducing the number of shipped bytes)
- it preserves the advantage of the beans pattern
An example for word count example would be something like this:
public class WordCountTuple extends Values {
private final static long serialVersionUID = -4386109322233754497L;
// attribute indexes
/** The index of the word attribute. */
public final static int WRD_IDX = 0;
/** The index of the count attribute. */
public final static int CNT_IDX = 1;
// attribute names
/** The name of the word attribute. */
public final static String WRD_ATT = "word";
/** The name of the count attribute. */
public final static String CNT_ATT = "count";
// required for serialization
public WordCountTuple() {}
public WordCountTuple(String word, int count) {
super.add(WRD_IDX, word);
super.add(CNT_IDX, count);
}
public String getWord() {
return (String)super.get(WRD_IDX);
}
public void setWort(String word) {
super.set(WRD_IDX, word);
}
public int getCount() {
return (Integer)super.get(CNT_IDX);
}
public void setCount(int count) {
super.set(CNT_IDX, count);
}
public static Fields getSchema() {
return new Fields(WRD_ATT, CNT_ATT);
}
}
To avoid inconsistencies, final static
variables for "word" and "count" attribute are used. Furthermore, a method getSchema()
returns the implemented schema to be used to declare output streams in Spout/Bolt method .declareOutputFields(...)
For output tuples, this type can be used straight forward:
public MyOutBolt implements IRichBolt {
@Override
public void execute(Tuple tuple) {
// some more processing
String word = ...
int cnt = ...
collector.emit(new WordCountTuple(word, cnt));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(WordCountTuple.getSchema());
}
// other methods omitted
}
For input tuples, I would suggest the following pattern:
public MyInBolt implements IRichBolt {
// use a single instance for avoid GC trashing
private final WordCountTuple input = new WordCountTuple();
@Override
public void execute(Tuple tuple) {
this.input.clear();
this.input.addAll(tuple.getValues());
String word = input.getWord();
int count = input.getCount();
// do further processing
}
// other methods omitted
}
MyOutBolt
and MyInBolt
can be connected as follows:
TopologyBuilder b = ...
b.setBolt("out", new MyOutBolt());
b.setBolt("in", new MyInBolt()).fieldsGrouping("out", WordCountTuple.WRD_ATT);
Using fields grouping is straight forward, because WordCountTuple
allows to access each attribute individually.
来源:https://stackoverflow.com/questions/32053795/how-to-use-apache-storm-tuple