Hazelcast - OperationTimeoutException

问题

I am using Hazelcast version 3.3.1.
I have a 9 node cluster running on aws using c3.2xlarge servers.
I am using a distributed executor service and a distributed map.
Distributed executor service uses a single thread. Distributed map is configured with no replication and no near-cache and stores about 1 million objects of size 1-2kb using Kryo serializer.
My use case goes as follow:

All 9 nodes constantly execute a synchronous remote operation on the distributed executor service and generate about 20k hits per second (about ~2k per node).
Invocations are executed using Hazelcast API: com.hazelcast.core.IExecutorService#executeOnKeyOwner.
Each operation accesses the distributed map on the node owning the partition, does some calculation using the stored object and stores the object in to the map. (for that I use the get and set API of the IMap object).

Every once in a while Hazelcast encounters a timeout exceptions such as:
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=GetOperation{}, partitionId=212, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.31.44.2]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0

In some cases I see map partitions start to migrate which makes thing even worse, nodes constantly leave and re-join the cluster and the only way I can overcome the problem is by restarting the entire cluster.

I am wondering what may cause Hazelcast to block a map-get operation for 120 seconds?
I am pretty sure it's not network related since other services on the same servers operate just fine. Also note that the servers are mostly idle (~70%).

Any feedbacks on my use case will be highly appreciated.

回答1:

Why don't you make use of an entry processor? This is also send to the right machine owning the partition and the load, modify, store is done automatically and atomically. So no race problems. It will probably outperform the current approach significantly since there is less remoting involved.

The fact that the map.get is not returning for 120 seconds is indeed very confusing. If you switch to Hazelcast 3.5 we added some logging/debugging stuff for this using the slow operation detector (executing side) and slow invocation detector (caller side) and should give you some insights what is happening.

Do you see any Health monitor logs being printed?

来源：https://stackoverflow.com/questions/28570609/hazelcast-operationtimeoutexception

标签

hazelcast

distributed-cache