Distributed caching in storm

问题

How to store the temporary data in Apache storm?

In storm topology, bolt needs to access the previously processed data.

Eg: if the bolt processes varaiable1 with result as 20 at 10:00 AM.

and again varaiable1 is received as 50 at 10:15 AM then the result should be 30 (50-20)

later if varaiable1 receives 70 then the result should be 20 (70-50) at 10:30.

How to achieve this functionality.

回答1:

In short, you wanted to do micro-batching calculations with in storm’s running tuples. First you need to define/find key in tuple set. Do field grouping(don't use shuffle grouping) between bolts using that key. This will guarantee related tuples will always send to same task of downstream bolt for same key. Define class level collection List/Map to maintain old values and add new value in same for calculation, don’t worry they are thread safe between different executors instance of same bolt.

回答2:

I'm afraid there is no such built-in functionality as of today. But you can use any kind of distributed cache, like memcached or Redis. Those caching solutions are really easy to use.

回答3:

There are a couple of approaches to do that but it depends on your system requirements, your team skills and your infrastructure.

You could use Apache Cassandra for you events storing and you pass the row's key in the tuple so the next bolt could retrieve it.

If your data is time series in nature, then maybe you would like to have a look at OpenTSDB or InfluxDB.

You could of course fall back to something like Software Transaction Memory but I think that would needs good amount of crafting.

回答4:

Uou can use CacheBuilder to remember your data within your extended BaseRichBolt (put this in the prepare method):

// init your cache.
this.cache = CacheBuilder.newBuilder()
                         .maximumSize(maximumCacheSize)
                         .expireAfterWrite(expireAfterWrite, TimeUnit.SECONDS)
                         .build();

Then in execute, you can use the cache to see if you have already seen that key entry or not. from there you can add your business logic:

// if we haven't seen it before, we can emit it.
if(this.cache.getIfPresent(key) == null) {
    cache.put(key, nearlyEmptyList);
    this.collector.emit(input, input.getValues());
}

this.collector.ack(input);

回答5:

This question is a good candidate to demonstrate Apache Spark's in memory computation over the micro batches. However, your use case is trivial to implement in Storm.

1) Make sure the bolt uses fields grouping. It will consistently hash the incoming tuple to the same bolt so we do not lose out on any tuple.

2) Maintain a Map in the bolt's local cache. This map will keep the last known value of a "variable".

class CumulativeDiffBolt extends InstrumentedBolt{

Map<String, Integer> lastKnownVariableValue;

@Override
public void prepare(){
     this.lastKnownVariableValue = new HashMap<>();
     ....

@Override
public void instrumentedNextTuple(Tuple tuple, Collector collector){
     .... extract variable from tuple
     .... extract current value from tuple
     Integer lastValue = lastKnownVariableValue.getOrDefault(variable, 0)
     Integer newValue = currValue - lastValue

     lastKnownVariableValue.put(variable, newValue)
     emit(new Fields(variable, newValue));
   ...
}

来源：https://stackoverflow.com/questions/28249388/distributed-caching-in-storm

标签

apache-storm